Midannii's

* 간단하게 요약

새로운 English reading comprehension benchmark인 , DROP 을 제안
- which requires Discrete Reasoning Over the content of Paragraphs.
- DROP dataset은
  - SQuAD보다 긴 context를 사용하고, 더 open-domain임
  - paragraph에 대한 깊은 understanding, high-quality information extraction이 필요함
- numerical reasoning을 위해 만들어진 데이터
  - subtraction, comparison, selection, addition, count, coreferences and resolution, set of span 등등 다양한 reasoning
NAQANet 모델 제안
- 기존 QANet에서, output layer를 4가지 방법으로 변형한 모델임
  - 모델이 생성할 수 있는 4가지 다른 종류의 답변(passage span, question span, count, arithmetic expression, answer type prediction)을 위한 4가지 종류의 output layer

1. Introduction

We introduce a substantially more challenging English reading comprehension dataset aimed at pushing the field towards more comprehensive analysis of paragraphs of text.
- DROP data에서, system에는 문단과 질문이 주어지고, 정답을 얻기 위해 문단의 텍스트에 대해 Discrete Reasoning을 수행해야함
We focus on this type of questions because they force a structured analysis of the content of the paragraph that is detailed enough to permit reasoning.
- Our goal is to further paragraph understanding; 복잡한 질문을 통해 paragraph’s semantics에 대한 system의 이해를 테스트 가능.
DROP is also designed to further research on methods that combine distributed representations with symbolic, discrete reasoning.
We used three types of systems to judge baseline performance on DROP:

(1) heuristic baselines, to check for biases in the data;

(2) SQuAD-style reading comprehension methods;

(3) semantic parsers operating on a pipelined analysis of the passage.
Finally, we contribute a new model for this task that combines limited numerical reasoning with standard reading comprehension methods, allowing the model to answer questions involving counting, addition and subtraction.

fig

Question Answering Dataset
- SQuAD(Stanford Question Answering Dataset)
- …
- DROP dataset은 SQuAD보다 긴 context를 사용하고, 더 open-domain이며, paragraph에 대한 깊은 understanding이 필요함
Semantic parsing
Adversarial dataset contruction
Numerical Reasoning
- DROP is designed to encourage research on methods that combine neural methods with discrete, symbolic reasoning. We present one such model in Section 6. Other related work along these lines has been done by Reed and de Freitas (2016), Neelakantan et al. (2016), and Liang et al. (2017).

3. DROP Data Collection

complex question을 잘 따르는 Wikipedia passage를 자동으로 추출
- We searched Wikipedia for passages that had a narrative sequence of events, particularly with a high proportion of numbers, as our initial pilots indicated that these passages were the easiest to ask complex questions about. … we additionally sampled from any Wikipedia passage that contained at least twenty numbers. This process yielded a collection of about 7,000 passages.
1의 passage로 question-answer pair를 crowdsourcing
- semantic parsing literature(addition/subtraction, minimum/maximum, counting, selection, comparison)의 예시 질문을 worker에게 보여주며, linguistic understanding과 discrete reasoning이 필요한 질문을 하도록 함
- DROP 데이터의 질문의 난이도를 더욱 높이기 위해 실시간 QA 모델 BiDAF가 해결할 수 없는 질문만 worker가 제출가능 (novel adverserial annotation setting를 사용)
- 각 worker는 세가지 답변 유형 중 하나로 자신의 질문에 대답함
  - 세가지 유형;
    - spans of text from either question or passage
    - a date (which was com- mon in history and open-domain text)
    - numbers, allowed only for questions which explicitly stated a specific unit of measurement (e.g., “How many yards did Brady run?”), in an attempt to simplify the evaluation process.
→ 96,567개의 question-answer pair얻음
DROP의 validate와 test data를 나눔; ensure quality, report inter-annotator agreement

4. DROP Data Analysis

Question analysis

training, dev로부터 350개의 sample question 추출

→ 과연 이게 전체 데이터를 잘 표현할 수 있을까?
category distribution, trigram pattern

→ 분석 결과

huge variety of linguistic constructs
most frequent pattern는 4%만 나타남 (비교적 균등하게 분포되어 있음)
number type questions에서, 5 most frequent question patterns은 모두 How many 로 시작 & 산술 연산 해야함

Answer analysis

DROP의 질문에 대답하는데 필요한 passage understanding level을 식별하기 위해서, 350개의 질문의 대답에 필요한 set of spans in the passage에 annotation함
- 평균적으로 2.18 spans이 필요
  - average distance between these spans is 26 words
  - 20% of samples needing at least 3 spans
spacy의 품사 태깅, named entity recognizer로 answer distribution를 평가함

→ 대부분의 답변이 숫자 값과 고유 명사임을 확인함

5. Baseline system

DROP dataset을 evaluation하기 위한 baseline
2개의 evaluation metrics 사용
- Exact-Match
  - SQuAD에서 사용한 방식과 동일
    - article을 제거하고, simple normalization 사용
- numeracy-focused(macro-averaged) F1 score
  - representation와 predicted answer의 bag-of-words overlap을 측정함
  - SQuAD에서 사용한 것을 base로 함
word overlap과 무관하게, 숫자가 다르면 F1을 0으로 함
답변에 multiple span이 있는 경우, set of spans의 bag-of-word overlap을 바탕으로 하나씩 one-to-one alignment수행한 후, 각 span의 평균 F1 계산
multiple annotated answer의 경우, 모든 gold answer보다 큰 exact-match, F1 score 가짐

5.1. Semantic Parsing

natural language를 formal executable languages(e.g. SQL)로 바꾸는 데에 사용됨

sentence representation schemes
- Stanford dependencies
- Open Information Extraction
- Semantic Role Labeling
logical form language
training and inference

5.2. SQuAD-style Reading Comprehension

SQuAD-style reading comprehension model을 DROP에 test함

BiDAF
QANET
- SQuAD 1.1에서 제일 성능이 좋은 모델임
QANET + ELMO
- QANet의 embedding과 ELMO representation을 concat
BERT

5.3. Heuristic Baselines

QANet model을 활용하여 question-only, paragraph-only에 대해 artifacts를 estimate함

6. NAQANet

DROP은 MRC with symbolic reasoning를 위해 만들어졌지만, 이 기능을 수행할 수 있는 모델은 baseline 중 아무것도 없음

→ numerically-aware QANet model인 NAQANet을 제안함
NAQANet은 아래와 같은 3가지 종류의 answer type을 생성할 수 있음
- spans from the question
- counts
- addition or subtraction
숫자를 예측하기 위해, model은 먼저 답이 count인지 arithmetic expression인지 예측

→ 표현식과 관련된 특정 숫자를 예측
- 이것은 부분적으로 실행된 logical form을 생성하는 neural model이며, final arithmetic은 symbolic system에 맡겨짐
The model is trained by marginalizing over all execution paths that lead to the correct answer.

6.1. Model Description

기존 RC model에서의 typical architecture를 사용
QANet을 보완하여 만들어졌음
- output layer까지 QANet 사용
  - QANet
    - model architecture
- 모델이 생성할 수 있는 4가지 다른 종류의 답변(passage span, question span, count, arithmetic expression, answer type prediction)을 위한 4가지 종류의 output layer 가짐

6.2. Weakly-Supervised Training

semantic parsing literature에서 많이 사용되는 Weakly-Supervised Training을 사용함
We find all executions that evaluate to the correct answer, including matching passage spans and question spans, correct count numbers, as well as sign assignments for numbers. Our training objective is then to maximize the marginal likelihood of these executions.

7. Results and Discussion

Difficulties of building semantic parsers
- DROP에서 semantic parsing baselines의 성능이 낮음
  - 기존 semantic parsing baseline의 pipeline이 training data의 subset에서만 logical form을 생성할 수 있기 때문임
    - 기존 semantic parsing baseline의 pipeline은 paragraphs에서 표 형식의 정보를 추출함
    - 60개 질문의 샘플과 SRL 방식으로 추출한 정보(3개 중 가장 우수한 성능)를 자세히 조사한 결과, 표의 25%만이 질문에 답하는 데 필요한 정보가 포함되어 있음
  → 즉, DROP을 위해서는 high-quality information extraction이 필요함
Error Analysis
- The most common errors were on questions which required complex types of reasoning, such as arithmetic operations (evident in 51% of the errors), counting (30%), domain knowledge and common sense (23%), co-reference (6%), or a combination of different types of reasoning (40%).