데이터 전처리 방법

1) numpy, pandas 등을 통해 데이터를 사용하기 전에 전처리
2) 데이터 API로 데이터를 적재할 때 동적으로 전처리
3) 전처리층을 직접 모델에 포함시킴

이 중에 세 번째 방법 : 전처리층을 직접 모델에 포함시키는 법을 알아보겠다.

means=np.mean(X_train, axis=0, keepdims=True)
stds=np.std(X_train, axis=0, keepdims=True)
eps=keras.backend.epsilon()
model=keras.models.Sequential([
    keras.layers.Lambda(lambda inpusts: (inputs - means) / (stds + eps)),
    ...
])

keras의 sequential모델을 생성하여 Lambda layer를 생성하여 전처리를 실행하는 모델이다.
Lambda layer에서는 standardization을 시행한다.


class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_=np.mean(data_sample, axis=0, keepdims=True)
        self.stds_=np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs-self.means_) / (self.stds_ + keras.backend.epsilon())

std_layer=Standardization()
std_layer.adapt(data_sample)

model=keras.Sequential()
model.add(std_layer)
...
model.compile([...])
model.fit([...])

사용자 정의 클래스를 통해서도 전처리 layer를 생성할 수 있다.
layer를 사용하기 전에 adapt 메소드를 사용해 데이터셋의 mean, std를 미리 계산해야 한다.
이 때 인자로 전체 데이터를 줄 필요 없이 랜덤하게 선택된 수백개의 데이터를 넘겨줘도 충분하다.
우리가 생성한 사용자 정의 layer와 비슷하게 keras.layers.Normalization() 메소드가 존재한다. 이를 통해 전처리 층을 쉽게 생성 가능하다.

13.3.1 원-핫 벡터를 사용해 범주형 특성 인코딩하기

범주형 특성의 경우 모델 생성을 위해 수치형으로 변환해야 한다.
범주형 특성은 one-hot vector를 통해 수치형으로 인코딩할 수 있다.


vocab=["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices=tf.range(len(vocab), dtype=tf.int64)
table_init=tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets=2
table=tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

categories=tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices=table.lookup(categories)
cat_one_hot=tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
cat_one_hot

# <tf.Tensor: shape=(4, 7), dtype=float32, numpy=
# array([[0., 0., 0., 1., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 1., 0.],
#        [0., 1., 0., 0., 0., 0., 0.],
#        [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>

oov bucket은 vocab에 정의되지 않은 범주를 대비하여 생성한다. 범주 개수가 너무 많아 사전에 정의해놓기 어려울 경우 oov bucket을 활용한다.
categories 중 "DESERT"가 정의되지 않은 범주이다. 이 범주는 one-hot encoding시 oov bucket의 위치인 5, 6번 중 5번에 매핑된다.
케라스 API에는 동일한 작업을 수행하는 keras.layers.TextVectorization층이 포함되어 있다. adapt() 이에 관해 추후에 연습문제에서 살펴본다.
범주가 몇 개 되지 않을 경우엔 one-hot encoding을 사용한다. 하지만, 범주 개수가 50개 이상이면 embedding이 선호된다. 10~50개 사이에 있다면 두 개를 모두 사용해 보고 최적 기법을 찾아 적용하면 된다.

13.3.2 임베딩을 사용해 범주형 특성 인코딩하기

임베딩은 범주를 표현하는 훈련 가능한 밀집 벡터를 뜻한다.
임베딩 값은 초기에 랜덤으로 초기화되고, 벡터의 차원 수는 하이퍼 파라미터를 통해 지정이 가능하다.
비슷한 의미를 가진 단어 벡터들 간의 거리는 가까워지고 반대의 경우엔 멀어진다.


vocab=["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices=tf.range(len(vocab), dtype=tf.int64)
table_init=tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets=2
table=tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

categories=tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices=table.lookup(categories)
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

# <tf.Tensor: shape=(4, 2), dtype=float32, numpy=
# array([[0.7309859 , 0.28189003],
#        [0.6422187 , 0.732231  ],
#        [0.05060315, 0.41399097],
#        [0.05060315, 0.41399097]], dtype=float32)>

tf.nn.embedding_lookup()함수는 임베딩 행렬에서 주어진 인덱스에 해당하는 행을 찾는다.


embedding=keras.layers.Embedding(input_dim=len(vocab) + num_oov_buckets,
                                 output_dim=embedding_dim)
embedding(cat_indices)

# <tf.Tensor: shape=(4, 2), dtype=float32, numpy=
# array([[ 0.03839845,  0.01719275],
#        [ 0.01920623,  0.03352095],
#        [-0.03839232, -0.04356638],
#        [-0.03839232, -0.04356638]], dtype=float32)>

케라스에는 이러한 임베딩 행렬을 처리해주는 keras.layers.Embedding 층이 존재한다. 이를 사용하면 쉽게 Embedding layer를 구현할 수 있다.
층이 생성될 때 embedding matrix를 random하게 초기화 하고 어떤 범주 인덱스로 호출될 때 임베딩 행렬에 있는 인덱스의 행을 반환한다.


regular_inputs=keras.layers.Input(shape=[8])
categories=keras.layers.Input(shape=[], dtype=tf.string)
cat_indices=keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed=keras.layers.Embedding(input_dim=6, output_dim=2)(cat_indices)
encoded_inputs=keras.layers.concatenate([regular_inputs, cat_embed])
outputs=keras.layers.Dense(1)(encoded_inputs)
model=keras.models.Model(inputs=[regular_inputs, categories],
                         outputs=[outputs])

embedding layer를 사용하여 케라스 모델을 생성한 코드이다.

13.3.3 케라스 전처리 층

keras.layers.Discretization()

Discretization 층은 연속적인 데이터를 특정 개수의 구간으로 나누어 one-hot encoding을 시행한다. 연속적인 값을 구간으로 나누어 처리하는 것은 잃는 정보가 많은 반면 연속적인 값에서는 관찰할 수 없는 특징이나 패턴을 관찰할 수 있다는 점에서 사용할만 하다.
Discretization 층은 미분 가능하지 않지만, 미분 가능할 필요가 없다. 전처리 층에 포함되고, 전처리층의 경우 경사하강법에 의해 영향을 받는 층이 아니기 때문이다.

keras.layers.PreprocessingStage()


normalization=keras.layers.Normalization()
discretization=keras.layers.Discretization([...])
pipeline=keras.layers.PreprocessingStage([normalization, discretization])
pipeline.adapt(data_sample)

전처리 층을 연결하는 파이프라인을 구성할 때 쓰인다.

이 외에도 여러가지 전처리층이 존재한다.

13.4 TF변환

데이터 처리

데이터가 작은 경우 : cache()메소드를 통해 RAM에 저장해 놓고 저장된 데이터를 호출하여 사용한다.
데이터가 클 경우 : Apache Beam이나 Spark같은 도구를 통해 대용량 데이터 처리를 위한 클러스터 컴퓨팅 엔진을 활용하여 pipeline을 구축한다. 이를 통해 모든 훈련 데이터를 훈련 전에 전처리할 수 있다.

훈련 서빙 왜곡

훈련시 전처리 속도와 배포 환경인 앱이나 브라우저의 전처리 속도에 차이가 생기는 것을 뜻한다.
사전 전처리의 경우 모델을 여러가지 플랫폼으로 배포했을 때 전처리 코드를 추가해야 하는 번거로움이 생긴다. 유지 보수를 어렵게 만들고, 앱이나 브라우저에서 전처리 연산을 추가적으로 수행해야 하므로 버그나 성능이 감소된다.
이를 해결하기 위해 층을 동적으로 추가하는 방법을 사용할 수도 있다. 하지만 이 것도 층에서의 동적인 전처리 연산 과정으로 인해 처리 속도를 감소 시킨다는 단점이 있다.

TF 변환

TF변환은 전처리 연산을 한 번만 수행하기 때문에 훈련 서빙 왜곡과 같은 현상이 일어나지 않고, 층을 동적으로 추가할 때의 단점도 상쇄할 수 있다는 장점이 있다.

아파치 빔을 사용해 이러한 전처리 함수를 전체 훈련 세트에 적용할 수 있다.

13.5 텐서플로 데이터셋 프로젝트

텐서플로우 데이터셋은 다양한 데이터셋을 제공한다.

import tensorflow_datasets as tfds

dataset=tfds.load(name='mnist')
mnist_train, mnist_test = dataset['train'], dataset['test']

mnist_train=mnist_train.shuffle(10000).batch(32).prefetch(1)
for item in mnist_train:
  images=item['image']
  labels=item['label']
  [...]

mnist_train=mnist_train.shuffle(10000).batch(32)
mnist_train=mnist_train.map(lambda items: (items['image'], items['label'])
mnist_train=mnist_train.prefetch(1)

dataset=tfds.lad(name='mnist', batch_size=32, as_supervised=True)
mnist_train=dataset['train'].prefetch(1)
model=keras.models.Sequential([...])
model.compile(los'sparse_categorical_crossentropy', optimizer='sgd')
model.fit(mnist_train, epochs=5)

데이터셋에는 각 아이템의 특성과 레이블을 담은 딕셔너리가 있다.
데이터셋을 받은 후 map()메서드를 통해 데이터를 변환시켜 사용해야 한다.

'Deep Learning > Hands On Machine Learning' 카테고리의 다른 글

15.2 RNN 훈련하기 (0)	2021.11.21
15.1 순환 뉴런과 순환 층 (0)	2021.11.21
13.2 TFRecord 포맷 (0)	2021.11.12
13.1 데이터 API (0)	2021.11.12
12.4 텐서플로 함수와 그래프 (0)	2021.11.11

대소기의 블로구

13.3 입력 특성 전처리

13.3.1 원-핫 벡터를 사용해 범주형 특성 인코딩하기

13.3.2 임베딩을 사용해 범주형 특성 인코딩하기

13.3.3 케라스 전처리 층

13.4 TF변환

13.5 텐서플로 데이터셋 프로젝트

'Deep Learning > Hands On Machine Learning' 카테고리의 다른 글

티스토리툴바

13.3 입력 특성 전처리

13.3.1 원-핫 벡터를 사용해 범주형 특성 인코딩하기

13.3.2 임베딩을 사용해 범주형 특성 인코딩하기

13.3.3 케라스 전처리 층

13.4 TF변환

13.5 텐서플로 데이터셋 프로젝트

'Deep Learning > Hands On Machine Learning' 카테고리의 다른 글

관련글

티스토리툴바