Midannii's

* 요약

spanBERT == text의 span을 더 잘 representation 및 predict할 수 있는 PLM model
기존 BERT의 연장선

(1) 기존 BERT는 token 단위 masking을 했는데, SpanBERT에서는 여러 단어에 걸쳐 있는 token들을 같이 masking (continuous token masking)

(2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.
spanBERT는 기존 BERT보다 span selection task(e.g. QA, conference resolution)에서 뛰어난 성능을 보임

1. Introduction

NLP task에서 multiple span selection하는 경우가 많음
multiple span selection task에서의 성능향상을 위해, 기존 BERT에서 두가지를 변경한 spanBERT제안함
- BERT와 다른 masking scheme
  
  즉, random individual token을 masking 하는게 아니라 random contiguous spans을 masking함
- BERT와 다른 training objective
  
  span-boundary objective (SBO)
  - 모델이 token의 Span의 boundary 에 있는 토큰들로 span 안의 토큰들을 예측하도록 학습됨
  - 모델이 boundary token에서의 span-level information를 저장함으로서, fine-tuning단계에서 이를 쉽게 access할 수 있게 함
baseline에서 single-sequence BERT 추가했음
- Next Sentence Prediction에서 two half-length segments보다 single segment에서 성능이 좋았기 때문
다양한 QA나 coference resolution에서 성능 향상하였음
기존 연구가 데이터 양이나 모델 크기를 증가하면서 성능 향상을 꾀했다면, 본 논문에서는 좋은 pre-training tasks & objectives를 통해 성능을 향상함

2. Background: BERT

transformer encoder 구조로 pre-training 하는 self-supervised approach임
pre training task로 MLM, NSP사용
BERT optimizes the MLM and the NSP objectives by masking word pieces uniformly at random in data generated by the bi-sequence sampling procedure. In the next section, we will present our modifications to the data pipeline, masking, and pre-training objectives.

3. Model

spanBERT는 BERT의 bi-text classification framework과는 다름 (다음과 같은 3가지 측면에서)

we use a different random process to mask spans of tokens, rather than individual ones. → 3.1 span masking
We also introduce a novel auxiliary objective – the span boundary objective (SBO) – which tries to predict the entire masked span using only the representations of the tokens at the span’s boundary. → 3.2 span boundary objective
SpanBERT samples a single contiguous segment of text for each training example (instead of two), and thus does not use BERT’s next sentence prediction objective. → 3.3. Single-Sequence Training

3.1. Span masking

Given a sequence of tokens X=(x1,x2,…,xn), we select a subset of tokens Y ⊆ X by iteratively sampling spans of text until the masking budget (e.g. 15% of X) has been spent.

how to?
- 각 iteration마다,
  1. geometric distribution으로부터 span length(즉 word의 개수)를 sampling 함
  2. masking할 span의 starting point를 random하게 선택함 (starting point는 단어의 시작이어야 함)
- 실험을 통해 p = 0.2로 세팅함
- 각 span에 대해 individually하게 masking 했던 기존 BERT의 MLM과 달리, span-level로 masking 함
- span내부의 모든 token은 [MASK] 또는 sampled token으로 replace됨

3.2. Span Boundary Objective (SBO)

일반적인 span selection model에서는, boundary span(start & end)을 사용하여 span의 fixed-length representation을 만듦
- 본 논문에서는, 가능한 많은 내부 span content를 요약하기 위해 span의 end에 대한 representation을 사용함
- 본 논문에서는, boundary에서 observe되는 span representation만을 사용하여 masked span의 각 token을 predict함 → span boundary objective !!
span boundary objective에서
token의 masked span인 (x_s, …, x_e) ∈ Y 와 (s는 start, e는 end임)

transformer encoder의 output인 x_1, … , x_n과

left boundary token x_s-1에 대해 masked된 token의 상대적인 위치를 표시하는 positional embedding p1, p2, …과

target token의 positional embedding인 p_i-s+1에 대해

각 token x_i의 representation은 external boundary token(start의 앞, end의 뒤)인 x_s-1, x_e+1를 사용하여 y_i = f(x_s-1, x_e+1, p_i−s+1)로 나타냄
2- layer feed-forward network 사용함
- 각 h0, h1에서 layer normalization, GeLU 사용하여 y_i를 얻음
- y_i를, x_i를 예측하고 cross-entrody loss로 계산 (MLM object 와 같은 방법)하는데에 사용함
spanBERT에서는
- MLM과 SBO에서의 target token에 대한 input embedding을 재사용
- span boundary object와 regular masked language model objectives의 loss를 합함
  
  L(x_i) = L_MLM(x_i) + L_SBO(x_i) = −logP(x_i x_i)−logP(x_i y_i)

3.3. Single-Sequence Training

2개의 sentence를 바탕으로 NSP를 적용하는 것보다, single sequence를 사용하는 것이 성능이 좋음

이유

(a) the model benefits from longer full-length contexts

또는

(b) conditioning on, often unrelated, context from another document adds noise to the masked language model.

→ 따라서, 본 논문에서는 최대 512개의 token을 가지는 single contiguous segment를 단순히 sampling함 (두개의 절반짜리 segment를 합쳐서 n개의 token을 만드는 방식이 아님)

how to?
1. Divide the corpus into single contiguous blocks of up to 512 tokens.
2. At each step of pre-training:
(a) Sample a batch of blocks uniformly at random.

(b) Mask 15% of word pieces in each block in the batch using the span masking scheme (Section 3.1).

(c) For each masked token xi, optimize (x_i) = L_MLM(x_i) + L_SBO(x_i)

4. Experimental setup

4.1. Tasks

extractive question answering
- Given a short passage of text and a question as input, the task of extractive question answering is to select a contiguous span of text in the passage as the answer.
- SQuAD 1.1 & 2.0, MRQA, NewsQA, TriviaQA, HotpotQA, Natural Questions
conference resolution
- Coreference resolution is the task of clustering mentions in text which refer to the same real-world entities.
- CoNLL-2012
relation extraction
- Given one sentence and two spans within it – subject and object – the task is to predict the relation between the spans from pre-defined relation types.
- TACRED
General Language Understanding Evaluation (GLUE) benchmark
- sentence-level classification(CoLA, SST-2), sentence-pari similarity(MRPC, STS-B, QQP), natural language inference(MNLI, QNLI, RTE, WNLI)

4.2. Implementation

기존 BERT를 reimplementation함

BERT large model 기반
BooksCorpus and English Wikipedia 로 학습
cased Wordpiece tokenizer 사용

기존 BERT와의 차이점 2가지

We use different masks at each epoch while BERT samples 10 different masks for each sequence during data processing.
We remove all the short-sequence strategies used before (they sampled shorter sequences with a small probability 0.1; they also first pre-trained with smaller sequence length of 128 for 90% of the steps). Instead, we always take sequences of up to 512 tokens until it reaches a document boundary.

또한

For the SBO, we use 200 dimension position embeddings
batch size of 256 sequences with a maximum of 512 tokens
AdamW, GeLU, dropout 0.1, 등등 ..

4.3. Baselines

GoogleBERT
Our BERT; 4.2에서 설명한 BERT
Our BERT-1seq; NSP 없이, single full-length sequence로 학습된 BERT (3.3에서 설명한 내용)

5. Result

spanBERT outperforms BERT on almost every task. (especially at extractive question answering)
- SQuAD 1.1, 2.0에서 기존의 google BERT보다 EM 및 F1이 모두 3~5 point 상승
- 다른 QA에서도 모두 상승하였음
- 전체적으로 google BERT < out BERT < our BERT-1seq < spanBERT 임
single-sequence training works considerably better than bi-sequence training with NSP.

6. Ablation Studies

6.1. Masking Schemes

즉, 본 논문에서의 span masking vs linguistically-informed masking

→ token masking 방식에 따라 5가지 baseline 설정함

subword tokens - original BERT의 방식
whole words - random word에 대해서, 모든 subword token을 masking함. 총 masked subtoken은 약 15%
named entities - 50%는 text의 named entity, 50%는 random word를 masking함
noun pharases - 50%는 text의 noun pharases 50%는 random word를 masking함
geometric spans - spanBERT의 방식

→ 결과

대부분 geometric span에서 성능이 제일 높으며, 그 다음이 noun phrase임
conference의 경우 subword token이 제일 성능 좋음
그러나 0.3 ~ 1.6 정도의 미세한 성능(F1)차이임
geometric span 성능향상에 영향을 많이 끼치는 task 순위는 QA > MNLI, QNLI이며, GLUE와 conference에서는 오히려 낮음
- 그러나 논문에서는 성능이 오히려 낮은 경우가 있지만 SBO와 결합하면 더 큰 성능 향상을 꾀할 수 있다고 서술함

6.2. Auxiliary Objectives

즉, 본 논문에서의 span boundary objective(SBO) vs BERT에서의 Next Sentence Prediction

→ span masking(2seq)+ NSP, span masking(1seq), span masking(1seq) + SBO를 비교

→ 결과

전체적으로 span masking(1seq) + SBO가 성능이 제일 좋음 (그러나 0.6 ~ 1.5 정도의 미세한 향상이긴 함)
본 논문에서는 Unlike the NSP objective, SBO does not appear to have any adverse effects. 라고 언급함