Midannii's

Attention mechanism은 NLP model에서 널리 사용되었음
- predictive performance 향상에 기여 & transparency 제공
그러나, attention weight와 model output이 어떤 relationship인지에 대한 내용이 제공되지 않음
본 논문에서는, 다양한 NLP task에서의 extensive experiment를 진행함

→ 대부분 prediction에 대해 meaningful “explanation”을 제공하지 않음을 확인함
- e.g. 대부분의 learned attention weight는 feature importance의 gradient-based measure과 correlation이 없으며, 그럼에도 동일한 predict를 산출하지만 다른 attention distribution을 identify함
→ 즉, 일반적인 attention module은 meaningfule explanation을 제공하지 않음 !!

1. Introducion and motivation

Attention mechanisms이란?
- input unit에 대한 conditional distribution을 유도하여, downstream module에 대한 weighted context vector를 구성함
- attention을 통해 model의 inner-working에 대한 insight를 제공받기도 함 : 즉, neural model이 어떻게 작동하는지에 대한 설명으로 여겨졌음
- attention은 이제 NLP에서 near-ubiquitous module임
→ attention에 대한 이런 믿음은, high attention weight을 가지는 input은 model output에 responsible하다는 것을 전제로 함

→ 그러나 이는 검증되지 않았음 ,,,

→ 만약 이 전제가 사실이라면, 다음과 같은 property가 유지되어야 함

(i) Attention weights should correlate with feature importance measures (e.g., gradient-based measures)

(ii) Alternative (or counterfactual) attention weight configurations ought to yield corresponding changes in prediction (and if they do not then are equally plausible as explanations).

→ 본 논문에서는, standard attention mechanism + BiLSTM으로 QA, NLI task를 실험해봄

→ 실험 결과, (i), (ii)모두 consistent하지 않음,,,
example in Figure 1
- 다른 단어에 대해 attention이지만, 예측 결과는 동일함
- 다음과 같은 empiricaal question을 탐색함으로서, 다른 task에서도 그런지 확인함
  - To what extent do induced attention weights correlate with measures of feature importance – specifically, those resulting from gradients and leave-one-out (LOO) methods?
    
    → Only weakly and inconsistently
  - Would alternative attention weights (and hence distinct heatmaps/“explanations”) necessarily yield different predictions?
    
    → No;
  → 즉,
  - 완전 다른 input feature에 attention 을 적용했음에도, 원래 유도된 attention weight을 사용했을떄와 동일한 adversarial attention distribution을 생성함
  - randomly permuting attention weight여도 output은 거의 같음(변화 거의 없음)
  - 단순한 feedforward(weighted average) encoder의 attention weight은 더 나음 → 뭔소리여

2. Preliminaries and Assumptions

3. Datasets and Tasks

binary text classification
- Stanford Sentiment Treebank (SST), IMDB Large Movie Reviews Corpus, Twitter Adverse Drug Reaction dataset, 20 Newsgroups, AG News Corpus (Business vs World), MIMIC ICD9 (Diabetes), MIMIC ICD9
QA
- CNN News Articles, bAbI
NLI
- SNLI dataset(with the labels neutral, contradiction, and entailment)

4. Experiments

key questions
- Do learned attention weights agree with alternative, natural measures of feature importance?
- Had we attended to different features, would the prediction have been different?

4.1 Correlation Between Attention and Feature Importance Measures

사용 metrics
- Total Variation Distance (TVD) to measure change between output distribution
- Jensen-Shannon Divergence (JSD) to quantify the difference between two attention distributions

즉, (1) attention과 gradient based measures of feature importance ( $τ_g$)과의 correlation,

(2) attention과 differences in model output induced by leaving features out($(τ_loo)$ 와의 correlation

을 계산함으로서, semantic으로 individual feature importance를 측정함

Untitled

→ result

Untitled

A. Model details (Appendix)

BiLSTM, CNN, Average

4.2. Counterfactual Attention Weights

→ model이 다른 input feature에 attention했을때 결과가 달라질까?

4.2.1 Attention Permutation

→ we first consider randomly permuting observed attention weights and recording associated changes in model outputs !

Untitled

4.2.2 Adversarial Attention

→ We propose explicitly searching for “adversarial” attention weights that maximally differ from the observed attention weights (which one might show in a heatmap and use to explain a model prediction), and yet yield an effectively equivalent prediction !

Untitled

fig

6. Discussion and Conclusions

결과
- 4.1에서는, intuitive feature importance measure(e.g. gradient, feature erasure approaches)와 learned attention weight 의 correlation이 약함을 밝힘
- 4.2에서는 model의 prediction 근거에 반하는 counterfactual attention distribution이 model output에 영향을 끼침을 밝힘
→ 즉, attention module은 NLP task에서의 성능 향상을 요하는건 맞지만, model prediction에 대한 meaningful explanation을 제공하는 것은 아님
- 특히 complex encoder가 사용되어, hidden state의 input이 얽힐 수 있음 (encoder에서 각 input의 임의의 interaction을 encoding하는 representation을 뱉어내기 때문)
- 또한 model이 도달한 특정 disposition과 attention weight은 항상 명확한 관계를 가지지 x
limitation
- We do not intend to imply that such alternative measures(e.g. gradient) are necessarily ideal or that they should be considered ‘ground truth’.
- 모델이 특정 방식으로 입력을 처리했기 때문에 모델이 특정 예측을 했다고 결론지을 수는 없음
- we have only considered a handful of attention variants, selected to reflect common module architectures for the respective tasks included in our analysis.
- irrelevant features may be contributing noise to the Kendall τ measure, thus depressing this metric artificially.
- we have limited our evaluation to tasks with unstructured output spaces, i.e., we have not considered seq2seq tasks

1. Introducion and motivation

2. Preliminaries and Assumptions

3. Datasets and Tasks

4. Experiments

4.1 Correlation Between Attention and Feature Importance Measures

4.2. Counterfactual Attention Weights

5. Related Work [생략]

6. Discussion and Conclusions