8. 텍스트 분류

NLP

8. 텍스트 분류

Dev.yeon 2021. 7. 23. 21:35

텍스트 분류란 문장을 입력으로 받아 사전에 정의된 클래스 중 어디에 속하는지 분류하는 과정을 말한다. 감정분석, 스팸메일탐지, 사용자 의도분류, 카테고리 분류 등 많은 분야로 응용될 수 있다. 전처리를 하는 과정에서 분류를 하기전에 표제어나 어간을 추출의 여부에 대해 무조건적인 답이 정해져 있는 것은 아니다. 딥러닝의 시대에 접어들면서 차원 축소가 가능해졌기 때문에, 희소성에 관련한 문제는 어느정도 해결되었고, 그에 따라 표제어와 어간추출을 하지 않기 시작했다. 처음에는 일단 추출을 하지 않고, 나중에 코퍼스 양의 부족이 성능저하의 원인이 된다면 그때 추출을 시도해보는 것이 좋다.

1. RNN을 활용한 텍스트 분류

파이토치를 활용하여 코드를 구성하였고, LSTM 내부의 각 계층간에는 드롭아웃이 추가되어있다. NLL손실 함수로 최적화 하기위해 일반적인 softmax함수 대신 로그확률을 반환하는 logsoftmax함수를 활용하였다.

import torch.nn as nn 

class RNNClassifier(nn.Module):

  def __init__(self,
               input_size,
               word_vec_dim,
               hidden_size,
               n_classes,
               n_layers=4,
               dropout_p=0.3):
    self.input_size=input_size
    self.word_vec_dim=word_vec_dim
    self.hidden_size=hidden_size
    self.n_classes=n_classes
    self.n_layers=n_layers
    self.dropout_p=dropout_p

    super().__init__()

    self.emb=nn.Embedding(input_size,word_vec_dim)
    self.rnn=nn.LSTM(input_size=word_vec_dim,
                     hidden_size=hidden_size,
                     num_layers=n_layers,
                     dropout=dropout_p,
                     batch_first=True,
                     bidirectional=True)
    self.generator=nn.Linear(hidden_size*2, n_classes)
    self.activation=nn.LogSoftmax(dim=1)

    def forward(self,x):
      # |x| = (batch_size, length)
      x = self.emb(x)
      # |x| = (batch_size, length, word_vec_size)
      x, _ = self.rnn(x)
      # |x| = (batch_size, length, hidden_size * 2)
      y = self.activation(self.generator(x[:, -1]))
      # |y| = (batch_size, n_classes)

2. CNN을 활용한 텍스트 분류

CNN은 합성곱연산을 통해 피드포워드 된 값에 역전파를 수행하여 더 나은 합성곱 필터를 찾아나간다. 합성곱 연산의 결과물은 필터의 크기에 따라 입력보다 크기가 줄어들기 때문에 같은 크기를 유지하려면 패딩을 추가해주어야 한다. CNN은 이미지처리를 위해 고안된 모델이지만 텍스트분류에서도 쓰인다. 원핫벡터를 단어 임베딩 벡터로 변환하여 1차원 벡터를 만든 후, 문장내의 모든 time-step의 단어 임베딩 벡터를 합치면 2차원의 행렬이 되는데, 이 행렬을 가지고 합성곱연산을 수행하면 텍스트에서도 CNN을 활용할 수 있다.

import torch
import torch.nn as nn


class CNNClassifier(nn.Module):

    def __init__(
        self,
        input_size,
        word_vec_size,
        n_classes,
        use_batch_norm=False,
        dropout_p=.5,
        window_sizes=[3, 4, 5],
        n_filters=[100, 100, 100],
    ):
        self.input_size = input_size 
        self.word_vec_size = word_vec_size
        self.n_classes = n_classes
        self.use_batch_norm = use_batch_norm
        self.dropout_p = dropout_p
        self.window_sizes = window_sizes
        self.n_filters = n_filters

        super().__init__()

        self.emb = nn.Embedding(input_size, word_vec_size)
        self.feature_extractors = nn.ModuleList()
        for window_size, n_filter in zip(window_sizes, n_filters):
            self.feature_extractors.append(
                nn.Sequential(
                    nn.Conv2d(
                        in_channels=1, # We only use one embedding layer.
                        out_channels=n_filter,
                        kernel_size=(window_size, word_vec_size),
                    ),
                    nn.ReLU(),
                    nn.BatchNorm2d(n_filter) if use_batch_norm else nn.Dropout(dropout_p),
                )
            )

        self.generator = nn.Linear(sum(n_filters), n_classes)
        self.activation = nn.LogSoftmax(dim=-1)

    def forward(self, x):
        # |x| = (batch_size, length)
        x = self.emb(x)
        # |x| = (batch_size, length, word_vec_size)
        min_length = max(self.window_sizes)
        if min_length > x.size(1):
            pad = x.new(x.size(0), min_length - x.size(1), self.word_vec_size).zero_()
            # |pad| = (batch_size, min_length - length, word_vec_size)
            x = torch.cat([x, pad], dim=1)
            # |x| = (batch_size, min_length, word_vec_size)

        x = x.unsqueeze(1)
        # |x| = (batch_size, 1, length, word_vec_size)

        cnn_outs = []
        for block in self.feature_extractors:
            cnn_out = block(x)
            # |cnn_out| = (batch_size, n_filter, length - window_size + 1, 1)

            cnn_out = nn.functional.max_pool1d(
                input=cnn_out.squeeze(-1),
                kernel_size=cnn_out.size(-2)
            ).squeeze(-1)
            
            # |cnn_out| = (batch_size, n_filter)
            cnn_outs += [cnn_out]
        cnn_outs = torch.cat(cnn_outs, dim=-1)
        # |cnn_outs| = (batch_size, sum(n_filters))
        y = self.activation(self.generator(cnn_outs))
        # |y| = (batch_size, n_classes)

        return y

※본 게시물은 「김기현의 자연어처리 딥러닝캠프-파이토치편」을 참고하여 작성되었습니다. 전체 코드는 김기현님의 깃허브에서 확인 할 수 있습니다.

https://github.com/kh-kim/simple-ntc/blob/master/simple_ntc/models/rnn.py

GitHub - kh-kim/simple-ntc: This repo provides a simple short-text classification code using RNN and CNN.

This repo provides a simple short-text classification code using RNN and CNN. - GitHub - kh-kim/simple-ntc: This repo provides a simple short-text classification code using RNN and CNN.

github.com

'NLP' 카테고리의 다른 글

7. 시퀀셜 모델링 (0)	2021.07.23
6. 단어 임베딩 (0)	2021.07.21
5. 자연어의 중의성을 해결하는 법 (0)	2021.07.21
4. 전처리 (0)	2021.07.19
3. 파이토치 기초문법 (0)	2021.07.13

현재글8. 텍스트 분류

Yeon’s Dev Log