RNN을 이용한 텍스트 생성

1) RNN을 이용하여 텍스트 생성하기

* 만약 '경마장에 있는 말이 뛰고 있다', '그의 말이 법이다', '가는 말이 고와야 오는 말이 곱다' 3가지 문장을 학습하려고 한다고 가정해보자.

* 이 3개의 문장들을 훈련할 때 X data, y data로 나눠야 할텐에 어떻게 나눠야 할까? 각 단어들을 y로 놓고 X에는 전체 문장에서 y단어 앞에 나왔던 단어들을 나열하면 될 것이다. 예를 들면 이렇게 말이다.

* 위와 같이 데이터셋을 구성하기 위해 다음과 같은 과정을 거친다.

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

text = """경마장에 있는 말이 뛰고 있다\n
그의 말이 법이다\n
가는 말이 고와야 오는 말이 곱다\n"""

#단어집합 생성
tokenizer=Tokenizer()
tokenizer.fit_on_texts([text])
word_index=tokenizer.word_index
sorted(word_index.items(), key=lambda x : x[1])

# [('말이', 1),
#  ('경마장에', 2),
#  ('있는', 3),
#  ('뛰고', 4),
#  ('있다', 5),
#  ('그의', 6),
#  ('법이다', 7),
#  ('가는', 8),
#  ('고와야', 9),
#  ('오는', 10),
#  ('곱다', 11)]

* tokenizer 객체를 생성한 뒤 예제 문장 3개로 이뤄져 있는 text를 fit하였다. 이를 통해 tokenizer의 단어 사전에 예문의 단어들이 저장되고 index까지 부여되었다.

sequences=list()
for line in text.split('\n'):
  encoded=tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(encoded)):
    sequence=encoded[:i+1]
    sequences.append(sequence)

print('학습에 사용할 샘플의 개수: %d \n' % len(sequences))
for sequence in tokenizer.sequences_to_texts(sequences):
  print(sequence)

# 학습에 사용할 샘플의 개수: 11 

# 경마장에 있는
# 경마장에 있는 말이
# 경마장에 있는 말이 뛰고
# 경마장에 있는 말이 뛰고 있다
# 그의 말이
# 그의 말이 법이다
# 가는 말이
# 가는 말이 고와야
# 가는 말이 고와야 오는
# 가는 말이 고와야 오는 말이
# 가는 말이 고와야 오는 말이 곱다

* sequences에 word index에 있는 단어 index로 표현된 문장들을 저장하였다. 실제로 잘 저장되었는지 확인하기 위해 sequences_to_texts() 메소드를 통해 역으로 변환을 해서 살펴보았다.

#샘플 길이 통일
max_len=max(len(l) for l in sequences)
sequences=pad_sequences(sequences, maxlen=max_len, padding='pre')
sequences

# array([[ 0,  0,  0,  0,  2,  3],
#        [ 0,  0,  0,  2,  3,  1],
#        [ 0,  0,  2,  3,  1,  4],
#        [ 0,  2,  3,  1,  4,  5],
#        [ 0,  0,  0,  0,  6,  1],
#        [ 0,  0,  0,  6,  1,  7],
#        [ 0,  0,  0,  0,  8,  1],
#        [ 0,  0,  0,  8,  1,  9],
#        [ 0,  0,  8,  1,  9, 10],
#        [ 0,  8,  1,  9, 10,  1],
#        [ 8,  1,  9, 10,  1, 11]], dtype=int32)

* 그리고 각 sequence의 길이가 다르기 때문에 zero padding을 통해 max_len길이로 sequence들을 통일하였다.

X=sequences[:, :-1]
y=sequences[:,-1]
X

# array([[ 0,  0,  0,  0,  2],
#        [ 0,  0,  0,  2,  3],
#        [ 0,  0,  2,  3,  1],
#        [ 0,  2,  3,  1,  4],
#        [ 0,  0,  0,  0,  6],
#        [ 0,  0,  0,  6,  1],
#        [ 0,  0,  0,  0,  8],
#        [ 0,  0,  0,  8,  1],
#        [ 0,  0,  8,  1,  9],
#        [ 0,  8,  1,  9, 10],
#        [ 8,  1,  9, 10,  1]], dtype=int32)

* X data, y data 로 split 해주었다.

vocab_size=len(tokenizer.word_index)+1
print(y)
y=to_categorical(y, num_classes=vocab_size)
print(y)

# [ 3  1  4  5  1  7  1  9 10  1 11]
# [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

* y data를 categorical data로 변환해주었다.

2) 모델 설계하기

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, SimpleRNN

embedding_dim=10
hidden_units=32

model=Sequential()
model.add(Embedding(vocab_size, embedding_dim)) #input vocabulary size, embedding vector size
model.add(SimpleRNN(hidden_units))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=200, verbose=1)

# Epoch 1/200
# 1/1 [==============================] - 3s 3s/step - loss: 2.5130 - accuracy: 0.0000e+00
# Epoch 2/200
# 1/1 [==============================] - 0s 11ms/step - loss: 2.4974 - accuracy: 0.0909
# ... 생략
# Epoch 199/200
# 1/1 [==============================] - 0s 16ms/step - loss: 0.0968 - accuracy: 1.0000
# Epoch 200/200
# 1/1 [==============================] - 0s 19ms/step - loss: 0.0952 - accuracy: 1.0000
# <keras.callbacks.History at 0x7f822579cf50>

* Embedding, vanilla RNN, Dense layer로 구성되어 있는 모델을 200번 훈련시켰다.

def sentence_generation(model, tokenizer, current_word, n):
    init_word = current_word
    sentence = ''

    for _ in range(n):
        encoded = tokenizer.texts_to_sequences([current_word])[0]
        encoded = pad_sequences([encoded], maxlen=5, padding='pre')
  
        result = model.predict(encoded, verbose=0)
        result = np.argmax(result, axis=1)

        for word, index in tokenizer.word_index.items(): 
            if index == result:
                break

        current_word = current_word + ' '  + word

        sentence = sentence + ' ' + word

    sentence = init_word + sentence
    return sentence
    
print(sentence_generation(model, tokenizer, '그의', 2))
print(sentence_generation(model, tokenizer, '경마장에', 4))
print(sentence_generation(model, tokenizer, '가는', 5))

# 그의 말이 법이다
# 경마장에 있는 말이 뛰고 있다
# 가는 말이 고와야 오는 말이 곱다

* 단어를 집어넣으면 예측을 return하는 함수를 생성하였다. 결과적으로 잘 예측이 되는 것을 알 수 있다.

https://wikidocs.net/45101

6) RNN을 이용한 텍스트 생성(Text Generation using RNN)

이번 챕터에서는 다 대 일(many-to-one) 구조의 RNN을 사용하여 문맥을 반영해서 텍스트를 생성하는 모델을 만들어봅시다. ##**1. RNN을 이용하여 텍스트 생 ...

wikidocs.net

'Deep Learning > 자연어처리' 카테고리의 다른 글

글자 단위 RNN(Char RNN) (0)	2021.12.18
LSTM을 사용하여 텍스트 생성하기 (0)	2021.12.18
RNN 언어 모델(RNNLM) (0)	2021.12.17
피드포워드 신경망 언어 모델(NNLM) (0)	2021.12.16
Keras를 통한 SimpleRNN, LSTM 출력값의 이해 (0)	2021.12.16

대소기의 블로구

RNN을 이용한 텍스트 생성

1) RNN을 이용하여 텍스트 생성하기

2) 모델 설계하기

'Deep Learning > 자연어처리' 카테고리의 다른 글

티스토리툴바

RNN을 이용한 텍스트 생성

1) RNN을 이용하여 텍스트 생성하기

2) 모델 설계하기

'Deep Learning > 자연어처리' 카테고리의 다른 글

관련글

티스토리툴바