Midannii's

1. INTRODUCTION

Is having better NLP models as easy as having larger models?
- large-size의 pretraining model은 성능 향상을 가져왔지만, GPU/TPU memory limitation을 유발함
이를 해결하기 위해, ALBERT제안
- 2 parameter reduction techniques
  - factorized embedding parameterization
    
    By decomposing the large vocabulary embedding matrix into two small matrices, we separate the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings.
  - cross-lingual parameter sharing
    
    This technique prevents the parameter from growing with the depth of the network
→ 이전 BERT보다 훨씬 작은 크기, 빠른 학습 속도
- Sentence-order prediction(SOP)를 위한 self-supervised loss를 제안
다양한 task에서 SOTA달성

BERT(transformer encoder)를 기본으로 함

Factorized embedding parameterization
- wordpiece embedding사용함
- modeling 관점에서 Wordpiece embedding은 context-independent representation을 배우고 hidden-layer embedding은 context-dependent representation을 배운다.
  - 자연어 처리는 일반적으로 vocabulary size V가 클수록 좋으며, 실제로 V가 E보다 훨씬 큼
  - 이로 인해 수십억개의 parameter가 있는 model이 쉽게 생성될 수 있으며, 대부분은 training에서 드물게 업데이트 된다.
  → ALBERT는 embedding parameter를 factorization하여 두 개의 작은 matrices로 분해한다. V x X -> V x E + E x H
  - 이렇게 두개의 matrix를 연달아 곱하는 방식으로 token embedding을 만들기 때문에, factorized embedding임
  - V는 크고 E,H는 작은 값이라서 parameter수 줄일 수 있음 -
Cross-layer parameter sharing
- Transformer layer 간 같은 Parameter를 공유하며 사용
- network depth에 따라 parameter가 커지는 것을 방지해줌 → 모델 크기 작음
- parameter sharing이 network의 parameter 안정화에 영향을 주는것을 알 수 있음
Inter-sentence coherence loss
- NSP를 개선하기 위해, SOP loss를 사용하여 문장간 일관성을 모델링함.
- SOP loss는 동일한 document에서 두 개의 연속 segment를 positive sample로 사용, 두 개의 segment의 순서가 바뀐것은 negative sample로 사용한다.
  
  → 문장의 순서가 옳은지 여부를 예측

ALBERT xxlarge > BERT
통신과 계산이 적기 때문에 ALBERT model은 BERT model에 비해 데이터 처리량이 더 높다.
- ALBERT-xlarge는 BERT-xlarge보다 2.4배 빠르게 train할 수 있다.

다음 세가지를 비교함
1. none(XLNet and RoBERTa)
2. NSP(BERT) → no benefit (topic shift만을 modeling함)
3. SOP(ALBERT) → multi-sentence encoding task에서 개선됨
- NSP로 학습하면 NOP 성능 낮지만, 그 반대에서는 NSP의 성능도 좋음

일반적으로 training time이 길수록 성능이 향상되므로 training step을 제어하는 대신 실제 training time을 제어하는 비교를 수행.
- BERT-large model(400k step)과 ALBERT-xxlarge model(125k step)의 training time은 거의 같다.(34h, 32h)
- ALBERT-xxlarge는 BERT-large와 거의 같은 시간을 training한 후 성능을 훨씬 더 좋음.

스크린샷 2022-01-20 오전 1.36.17.png

스크린샷 2022-01-20 오전 1.35.27.png