🍳 reference: <모두를 위한 딥러닝 2: pytorch> Lab 9

Vanishing Gradient

기존 sigmoid를 activation funtion으로 사용했었는데,

fig

위 그래프에서 확인가능하듯 양끝단에서의 gradient는 아주 작은 (0에 가까운) 값이다.

backpropagation에서, 계산한 값을 앞으로 보낼때 gradient를 곱하게 되는데

0에 가까운 값을 곱하게 되면, Loss로부터 전달되는 gradient가 소멸되게 된다.

특히 여러 layer가 쌓여있는 Deep neural network의 경우 문제가 된다.

이런 경우를 vanishing gradient라고 한다.

RELU

relu 함수는 아래와 같다.

fig

이 경우, 양수일때 vanishing gradient 문제없이 gradient를 곱할 수 있다.

pytorch에서는 아래와 같은 코드를 사용한다.

relu = torch.nn.ReLU()

# 또는
torch.nn.relu(x)

경우에 따라 leaky relu를 사용하기도 하는데,

fig

위와 같이 음수일때 0 이 아닌 값을 출력한다.

torch.nn.leaky_relu(x, 0.01)

optimizer in pytorch

pytorch에는 아래와 같이 다양한 optimizer가 구현되어 있다.

torch.optim.SGD
torch.optim.Adadelta
torch.optim.Adagrad
torch.optim.Adam
torch.optim.SparseAdam
torch.optim.Adamax
torch.optim.ASGD
torch.optim.LBFGS
torch.optim.RMSprop
torch.optim.Rprop

쨌든 예시 코드를 보면,

# parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100

# dataset loader
data_loader = torch.utils.data.DataLoader(dataset=mnist_train, batch_size=batch_size, shuffle=True, drop_last=True)

# nn layers
linear1 = torch.nn.Linear(784, 256, bias=True) # 784 = 28*28 (MNIST data image's shape)
linear2 = torch.nn.Linear(256, 256, bias=True)
linear3 = torch.nn.Linear(256, 10, bias=True) # output layer = 10 [0-9]
relu = torch.nn.ReLU()
# Initialization
torch.nn.init.normal_(linear1.weight)
torch.nn.init.normal_(linear2.weight)
torch.nn.init.normal_(linear3.weight)
# model
model = torch.nn.Sequential(linear1, relu, linear2, relu, linear3).to(device)
# define cost/loss & optimizer
criterion = torch.nn.CrossEntropyLoss().to(device)    # Softmax is internally computed.
  ## 3번째에서 relu 적용하지 않은 이유는, CrossEntropyLoss를 구할거기 때문
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # ADAM을 optimizer로 사용

# training
total_batch = len(data_loader)
for epoch in range(training_epochs):
    avg_cost = 0

    for X, Y in data_loader:
        # reshape input image into [batch_size by 784]
        # label is not one-hot encoded
        X = X.view(-1, 28 * 28).to(device)
        Y = Y.to(device)

        optimizer.zero_grad()
        hypothesis = model(X)
        cost = criterion(hypothesis, Y)
        cost.backward()
        optimizer.step()

        avg_cost += cost / total_batch

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))
# Test the model using test sets
with torch.no_grad():
    X_test = mnist_test.test_data.view(-1, 28 * 28).float().to(device)
    Y_test = mnist_test.test_labels.to(device)

    prediction = model(X_test)
    correct_prediction = torch.argmax(prediction, 1) == Y_test
    accuracy = correct_prediction.float().mean()
    print('Accuracy:', accuracy.item())

    # Get one and predict
    r = random.randint(0, len(mnist_test) - 1)
    X_single_data = mnist_test.test_data[r:r + 1].view(-1, 28 * 28).float().to(device)
    Y_single_data = mnist_test.test_labels[r:r + 1].to(device)

    print('Label: ', Y_single_data.item())
    single_prediction = model(X_single_data)
    print('Prediction: ', torch.argmax(single_prediction, 1).item())

Weight Initialization

weight Initialization은 성능 향상에 매우 중요하다.

그러나 0으로 초기화하는건 좋지 않고(backpropagation에서 gradient 계산 때문),

RBM을 이용하는 것이 좋다는 결과가 밝혀졌다 (2006)

RBM (Restricted Boltzmann machine)

fig

RBM은 위와 같은 구조를 가지고 있다.

restricted 의 의미는 layer 내부에서는 no connection이라는 뜻. 그러나 다른 layer에서는 fully connected 이다.

입력 x가 들어갔을때 y를 내놓거나, y를 보고 x를 복원할 수 있다. (encoding decoding 느낌)

이러한 RBM을 이용하여 pre-training하여 weight Initialization할 수 있다.

pre-training 방식은 아래와 같다.

fig

(a)에서, 2개의 layer에 대해 RBM으로 학습 - input x와 output y에 대해, y를 넣었을때 x’을 복원하도록 학습
(b)에서, 2개의 layer위의 새로운 layer를 쌓는다. 이때 1에서 학습한 x, h1은 고정시키고, 새로운 layer에 대해 학습
마지막 layer까지 반복 - 한 layer씩 새로 쌓으며 학습
Fine tuning: 학습이 완료되면, 전체를 다 쌓아서 학습 - input x 에 대해 output y를 출력하고, 정답과 비교하여 backpropagation을 진행하는 방식으로 학습

사실 요즘에는 RBM을 잘 사용하지 않고, Xaier, He와 같은 initializer를 사용한다.

이들은 layer의 특성에 따라 다르게 초기화를 한다.

Xavier / He Initialization

fig

굉장히 간단하고 강력한편,,

이때 nin은 input layer의 개수, nout은 output layer의 개수이다.

코드는 아래와 같다.

.
.
# nn layers
linear1 = torch.nn.Linear(784, 256, bias=True)
linear2 = torch.nn.Linear(256, 256, bias=True)
linear3 = torch.nn.Linear(256, 10, bias=True)
relu = torch.nn.ReLU()

# xavier initialization
torch.nn.init.xavier_uniform_(linear1.weight)
torch.nn.init.xavier_uniform_(linear2.weight)
torch.nn.init.xavier_uniform_(linear3.weight)
.
.

Dropout

dropout은 overfitting의 해결책 중 하나인데,

fig

학습을 진행하면서 node들을 무작위로 껐다 켰다 하는 것이다.

즉, 일부 node들만 사용하면서 학습하는 것이다.

dropout = torch.nn.Dropout(p=drop_prob)
.
.
model = torch.nn.Sequential(linear1, relu, dropout,
                            linear2, relu, dropout,
                            linear3, relu, dropout,
                            linear4, relu, dropout,
                            linear5).to(device)

매 학습마다 다른 형태의 network를 가지고 학습하기 때문에, overfitting을 막을 수 있다.

앙상블 효과도 낼 수 있다는 평가도 받는다.

train할때는 model.train(), 평가할 때엔 model.eval()을 사용해야 한다.

Gradient vanishing, Gradient exploding

앞서 언급했던 gradient vanishing과 반대되는 것이 gradient exploding이다.

즉 gradient가 너무 커서 발생하는 문제이다.

이를 해결할 수 있는게 activation funtion 바꾸기 (Relu, leaky relu.. ),

weight initialization(Xavier, He .. ), 작은 learning rate 설정,

그리고 barch Normalization이다.

Internal Covariate Shift

batch normalization을 들어가기 전, gradient문제의 원인인 internal covariate shift를 먼저 살펴보자.

covariate shift는 training set의 분포와 test set의 분포가 차이를 갖는 것이다.

fig

Internal Covariate Shift는 model의 각 layer마다 covariate shift 문제가 발생한다는 것인데,

점점 input과 output의 분포가 달라지는 것이당!

이를 완화하기 위해 input data를 normalization하여 사용했는데, batch normalization은 이와 조금 다르다.

Batch Normalization

Batch Normalization을 통해 gradient 문제들을 근본적으로 해결할 수 있고, 안정적으로 학습할 수 있다.

batch normalization은 각 Layer마다 생기는 internal covariate shift를 해결하기 위함이다.

(당연히 layer가 깊어질수록 점점 shift 가 심해지겠지?)

즉, 각 layer마다 normalization을 진행 - 이때 mini batch마다 normalization을 하는 것이당

fig

training dataset에서 각각 sample mean, sample variance를 구한 후 normalization 하고,

이후 r*x_hat + B를 구한다. (이때 r, B는 backpropagation으로 학습한 parameter)

이때 쓰인 sample mean과 sample variance는 따로 저장해둔다.

학습이 끝나면, test data에 대해 학습된 mean과 variance를 사용해 normalization한 후 (학습한 r,B를 사용한다. )

이를 통해 batch의 구성이 달라도 같은 output을 도출할 수 있다. (batch normalization 문제 해결 가능 )