16_1_Char_RNN을_사용해_셰익스피어_문체_생성하기

In [16]:

import sklearn
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

16.1 훈련 데이터셋 만들기¶

In [17]:

shakespear_url="https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath=keras.utils.get_file('shakespear.txt', shakespear_url)
with open(filepath) as f:
  shakespear_text=f.read()

Downloading data from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
1122304/1115394 [==============================] - 0s 0us/step
1130496/1115394 [==============================] - 0s 0us/step

In [19]:

print(shakespear_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

In [20]:

"".join(sorted(set(shakespear_text.lower())))

Out[20]:

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [21]:

#tokenizer 객체 생성
#char_level=True를 통해 단어 수준이 아닌 글자 수준으로 인코딩
tokenizer=keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespear_text)

In [22]:

max_id=len(tokenizer.word_index)
dataset_size=tokenizer.document_count

In [23]:

#sequence의 원소를 1~39가 아니라 0~38까지 얻기 위해 -1을 해준다.
[encoded]=np.array(tokenizer.texts_to_sequences([shakespear_text])) - 1
train_size=dataset_size * 90 // 100 
#data의 처음부터 90%까지를 train data로 하는 dataset만들기
dataset=tf.data.Dataset.from_tensor_slices(encoded[:train_size]) 

순차 데이터를 윈도 여러 개로 자르기¶

window() 메소드를 사용해 데이터로 사용될 시퀀스의 길이를 짧게 조절한다.
window() 메소드를 사용하면 중첩 데이터셋을 만들 수 있다. 하지만, 모델의 입력은 tensor 형식을 띠고 있어야 하기 때문에 flat_map() 메소드를 통해 flatten 작업을 추가적으로 해줘야 한다.
flat_map()함수는 각 데이터셋에 적용할 함수를 매개변수로 받을 수 있다. ex) lambda ds : ds.batch(2)
우리의 데이터셋에서는 입력 window보다 1 step shift한 window를 target으로 사용할 것이다.

In [24]:

#dataset을 window당 101개의 글자를 가지도록 설정.
n_steps = 100
window_length=n_steps+1
dataset=dataset.window(window_length, shift=1, drop_remainder=True) #drop_remainder를 통해 모든 윈도우의 길이 통일

In [25]:

#window flatten
dataset=dataset.flat_map(lambda window: window.batch(window_length))

In [26]:

batch_size=32
dataset=dataset.shuffle(10000).batch(batch_size)
dataset=dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [27]:

#X_batch one-hot encoding
dataset=dataset.map(
    lambda X_batch, Y_batch : (tf.one_hot(X_batch, depth=max_id), Y_batch))

In [28]:

dataset=dataset.prefetch(1)

In [29]:

for X_batch, Y_batch in dataset.take(1):
  print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)

Char-RNN 모델 만들고 훈련하기¶

In [ ]:

model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     #dropout=0.2, recurrent_dropout=0.2),
                     dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     #dropout=0.2, recurrent_dropout=0.2),
                     dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, epochs=10)

Epoch 1/10
31368/31368 [==============================] - 376s 12ms/step - loss: 1.6206
Epoch 2/10
31368/31368 [==============================] - 351s 11ms/step - loss: 1.5369
Epoch 3/10
31368/31368 [==============================] - 347s 11ms/step - loss: 1.5171
Epoch 4/10
31368/31368 [==============================] - 344s 11ms/step - loss: 1.5053
Epoch 5/10
31368/31368 [==============================] - 346s 11ms/step - loss: 1.4980
Epoch 6/10
31368/31368 [==============================] - 344s 11ms/step - loss: 1.4927
Epoch 7/10
31368/31368 [==============================] - 344s 11ms/step - loss: 1.4891
Epoch 8/10
31368/31368 [==============================] - 347s 11ms/step - loss: 1.4864
Epoch 9/10
31368/31368 [==============================] - 345s 11ms/step - loss: 1.4842
Epoch 10/10
31368/31368 [==============================] - 346s 11ms/step - loss: 1.4821

Char-RNN 모델 사용하기¶

In [37]:

#전처리 함수 생성
def preprocess(text):
  X=np.array(tokenizer.texts_to_sequences(shakespear_text)) - 1
  return tf.one_hot(X, max_id)

In [ ]:

X_new = preprocess(["How are yo"])
#Y_pred = model.predict_classes(X_new)
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

Out[ ]:

'u'

가짜 셰익스피어 텍스트를 생성하기¶

Char-RNN 모델을 통해 텍스트를 생성하기 위해서는 초기 텍스트를 입력한 후, 출력으로 나온 예측 character를 다시 입력에 포함해 다음 출력을 반환받는 방법을 사용한다. 하지만, 이 방법을 사용하면 같은 단어가 계속 출력되는 경우가 종종 있다.
이를 방지하기 위해 tf.random.categorical() 함수를 사용해 모델이 추정한 확률을 기반으로 다음 글자를 무작위 선택하는 방법을 사용한다.

In [39]:

# temperature가 높아질 수록 모든 단어의 선택 확률이 동일해진다.
def next_char(text, temperature=1):
  X_new=preprocess([text])
  y_proba=model(X_new)[0, -1:, :]
  rescaled_logits=tf.math.log(y_proba) / temperature
  char_id=tf.random.categorical(rescaled_logits, num_samples=1) + 1
  return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [40]:

def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [ ]:

tf.random.set_seed(42)

print(complete_text("t", temperature=0.2))

the maid in padua for my father is a stood
and so m

In [ ]:

print(complete_text("t", temperature=1))

toke on advised in sobel countryman,
and signior gr

In [ ]:

print(complete_text("t", temperature=2))

tpeniomently!
well maze: yet 'pale deficuruli-faeem

16.1.7 상태가 있는 RNN¶

상태가 있는 RNN : time step마다 update된 hiddden state는 다음 훈련 배치에서는 사용되지 않는다. 하지만, 다음 훈련 배치에서 이전 훈련 배치의 마지막 hidden state값을 이어받아 사용하게 되면 모델이 장기간 패턴을 학습할 수 있다는 장점이 있다.
입력 시퀀스는 이전 배치의 시퀀스가 끝난 지점에서 시작해야 한다(순차적이고 겹치지 않아야 함). 예를 들어

'Deep Learning > Hands On Machine Learning' 카테고리의 다른 글

16.3 신경망 기계 번역을 위한 인코더-디코더 네트워크 (0)	2021.11.24
16.2 감성 분석 (0)	2021.11.23
15.4 긴 시퀀스 다루기 (0)	2021.11.21
15.2 RNN 훈련하기 (0)	2021.11.21
15.1 순환 뉴런과 순환 층 (0)	2021.11.21

대소기의 블로구

16.1 Char-RNN을 통해 셰익스피어 문체 생성하기

16.1 훈련 데이터셋 만들기¶

순차 데이터를 윈도 여러 개로 자르기¶

Char-RNN 모델 만들고 훈련하기¶

Char-RNN 모델 사용하기¶

가짜 셰익스피어 텍스트를 생성하기¶

16.1.7 상태가 있는 RNN¶

'Deep Learning > Hands On Machine Learning' 카테고리의 다른 글

티스토리툴바

16.1 Char-RNN을 통해 셰익스피어 문체 생성하기

16.1 훈련 데이터셋 만들기¶

순차 데이터를 윈도 여러 개로 자르기¶

Char-RNN 모델 만들고 훈련하기¶

Char-RNN 모델 사용하기¶

가짜 셰익스피어 텍스트를 생성하기¶

16.1.7 상태가 있는 RNN¶

'Deep Learning > Hands On Machine Learning' 카테고리의 다른 글

관련글

티스토리툴바