Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

This repository contains the official code of the paper: "Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies", accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2021.

Citation

@article{geva2021strategyqa,
  title = {{Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies}},
  author = {Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan},
  journal = {Transactions of the Association for Computational Linguistics (TACL)},
  year = {2021},
}

Following are instructions to reproduce the experiments reported in the paper, on the StrategyQA dataset.

Setup

Requirements

Our experiments were conducted in a Python 3.7 environment. To clone the repository and set up the environment, please run the following commands:

git clone https://github.com/eladsegal/strategyqa.git
cd strategyqa
pip install -r requirements.txt

StrategyQA dataset files

The official StrategyQA dataset files with a detailed description of their format can be found on the dataset page.
To train our baseline models, we created a 90%/10% random split of the official train set to get an unofficial train/dev split: data/strategyqa/[train/dev].json.

(Optional) Creating an Elasticsearch index of our corpus

Download link to our full corpus of Wikipedia paragraphs is available on the dataset page. A script for indexing the paragraphs into Elasticsearch is available here.

Training

In scripts with GPU, replace it with a GPU, a list of GPUs or -1 for CPU.
Download links to our trained models are provided in Links to our Trained Models.

RoBERTa*

RoBERTa* is a RoBERTa model fine-tuned on auxiliary datasets that we used as our base model when fine-tuning on StrategyQA. We trained RoBERTa* as follows:

Download twentyquestions dataset and extract it to data/, so you have data/twentyquestions/twentyquestions-[train/dev].jsonl.
Download BoolQ dataset and extract it to data/, so you have data/boolq/[train/dev].jsonl.

 python run_scripts/train_RoBERTa_STAR.py -s OUTPUT_DIR -g "GPU"

A trained RoBERTa* model can be found here.

Question Answering Models

The directory configs/strategy_qa containes configuration files for the question answering models described in the paper. To train a question answering model of a specific configuration, run the train.py script as follows:

python run_scripts/train.py --config-file configs/strategy_qa/CONFIG_NAME.jsonnet -s OUTPUT_DIR -g "GPU" -w [path to a RoBERTa* model (.tar.gz file)]

A trained model for each configuration can be found in https://storage.googleapis.com/ai2i/strategyqa/models/CONFIG_NAME.tar.gz,
and evaluation scores for it on the used dev set (Setup) can be found in https://storage.googleapis.com/ai2i/strategyqa/models/CONFIG_NAME.json.

Figures depicting the resource dependency of the training procedures can be found here.

Notes:

Configurations with "base" in their name are not runnable on their own.
Models that query the Elasticsearch server won't be able to get results for queries that aren't already in data/queries_cache.json, unless an Elasticsearch server is set up and referred to in src/data/dataset_readers/utils/elasticsearch_utils.py. See more details on setting up an Elasticsearch index in Setup.
The config 4_STAR_IR-D.jsonnet is not trainable, but used only for evaluation of 5_STAR_IR-ORA-D.jsonnet with decompositions generated with BART-Decomp.
It requires data/strategyqa/generated/bart_decomp_dev_predictions.jsonl, see Question Decomposition Model - BART-Decomp to learn how to generate it. A dependency graph can be found here.
To create an AllenNLP model archive for it, run the following:
```
python tools/tar_to_tar.py [path to a 5_STAR_IR-ORA-D model (.tar.gz file)] configs/4_STAR_IR-D.jsonnet 4_STAR_IR-D.tar.gz
```
The config 8_STAR_ORA-P-D-last-step.jsonnet requires data/strategyqa/transformer_qa_ORA-P_[train/dev]_no_placeholders.json, see Iterative Answering of Decompositions to learn how to generate it. A dependency graph can be found here.

Question Decomposition Model (BART-Decomp)

Train the model:

python run_scripts/train.py --config-file configs/decomposition/bart_decomp_strategy_qa.jsonnet -s OUTPUT_DIR -g "GPU"

A trained model can be found here.

Output predictions:

python run_scripts/predict.py --model [path to a BART-Decomp model (.tar.gz file)] --data data/strategyqa/dev.json -g "GPU" --output-file data/strategyqa/generated/bart_decomp_dev_predictions.jsonl

Iterative Answering of Decompositions

Download BoolQ dataset and extract it to data/, so you have data/boolq/[train/dev].jsonl.
Download SQuAD 2.0 dataset and extract it to data/, so you have data/squad_v2/[train/dev]-v2.0.json.

Append BoolQ to SQuAD:

python -m tools.squadify_boolq data/boolq/train.jsonl data/squad/squad_v2_boolq_dataset_train.json --append-to data/squad/train-v2.0.json

python -m tools.squadify_boolq data/boolq/dev.jsonl data/squad/squad_v2_boolq_dataset_dev.json --append-to data/squad/dev-v2.0.json

Train a RoBERTa Extractive QA model on SQuAD and BoolQ:

python run_scripts/train.py --config-file configs/squad/transformer_qa_large.jsonnet -s OUTPUT_DIR -g "GPU"

A trained model can be found here.

Replace the placeholders in the gold decomposition:

python -m src.models.iterative.run_model -g [GPU (single only)] --qa-model-path ../experiments/publish/transformer_qa_large.tar.gz --paragraphs-source ORA-P --data data/strategyqa/train.json --output-predictions-file data/strategyqa/generated/transformer_qa_ORA-P_train_no_placeholders.json

python -m src.models.iterative.run_model -g [GPU (single only)] --qa-model-path ../experiments/publish/transformer_qa_large.tar.gz --paragraphs-source ORA-P --data data/strategyqa/dev.json --output-predictions-file data/strategyqa/generated/transformer_qa_ORA-P_dev_no_placeholders.json

This script allows for different paragraphs sources to be used (IR-Q/ORA-P/IR-ORA-D/IR-D), and can also work on generated decompositions instead of the gold ones (use --generated-decompositions-paths).

Prediction and Evaluation

The StrategyQA leaderboard is available here.

The official evaluation script can be found here.

Question Answering

Evaluate accuracy:

python run_scripts/evaluate.py --model [path to a QA model (.tar.gz file)] --data DATA_PATH -g "GPU"