All Projects → AkariAsai → extractive_rc_by_runtime_mt

AkariAsai / extractive_rc_by_runtime_mt

Licence: other
Code and datasets of "Multilingual Extractive Reading Comprehension by Runtime Machine Translation"

Programming Languages

python
139335 projects - #7 most used programming language
perl
6916 projects

Projects that are alternatives of or similar to extractive rc by runtime mt

co-attention
Pytorch implementation of "Dynamic Coattention Networks For Question Answering"
Stars: ✭ 54 (+50%)
Mutual labels:  question-answering, squad, reading-comprehension
PersianQA
Persian (Farsi) Question Answering Dataset (+ Models)
Stars: ✭ 114 (+216.67%)
Mutual labels:  question-answering, squad, reading-comprehension
exams-qa
A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering
Stars: ✭ 25 (-30.56%)
Mutual labels:  multilingual, question-answering, reading-comprehension
TOEFL-QA
A question answering dataset for machine comprehension of spoken content
Stars: ✭ 61 (+69.44%)
Mutual labels:  question-answering, reading-comprehension
explicit memory tracker
[ACL 2020] Explicit Memory Tracker with Coarse-to-Fine Reasoning for Conversational Machine Reading
Stars: ✭ 35 (-2.78%)
Mutual labels:  question-answering, reading-comprehension
SQUAD2.Q-Augmented-Dataset
Augmented version of SQUAD 2.0 for Questions
Stars: ✭ 31 (-13.89%)
Mutual labels:  question-answering, squad
cdQA-ui
⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.
Stars: ✭ 19 (-47.22%)
Mutual labels:  question-answering, reading-comprehension
qa
TensorFlow Models for the Stanford Question Answering Dataset
Stars: ✭ 72 (+100%)
Mutual labels:  question-answering, squad
Awesome Qa
😎 A curated list of the Question Answering (QA)
Stars: ✭ 596 (+1555.56%)
Mutual labels:  question-answering, squad
XORQA
This is the official repository for NAACL 2021, "XOR QA: Cross-lingual Open-Retrieval Question Answering".
Stars: ✭ 61 (+69.44%)
Mutual labels:  multilingual, question-answering
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+9369.44%)
Mutual labels:  question-answering, squad
cmrc2017
The First Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2017)
Stars: ✭ 90 (+150%)
Mutual labels:  question-answering, reading-comprehension
question-answering
No description or website provided.
Stars: ✭ 32 (-11.11%)
Mutual labels:  question-answering, squad
Medi-CoQA
Conversational Question Answering on Clinical Text
Stars: ✭ 22 (-38.89%)
Mutual labels:  question-answering, squad
ODSQA
ODSQA: OPEN-DOMAIN SPOKEN QUESTION ANSWERING DATASET
Stars: ✭ 43 (+19.44%)
Mutual labels:  question-answering, reading-comprehension
Bi Att Flow
Bi-directional Attention Flow (BiDAF) network is a multi-stage hierarchical process that represents context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.
Stars: ✭ 1,472 (+3988.89%)
Mutual labels:  question-answering, squad
cmrc2019
A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019)
Stars: ✭ 118 (+227.78%)
Mutual labels:  question-answering, reading-comprehension
Question-Answering-based-on-SQuAD
Question Answering System using BiDAF Model on SQuAD v2.0
Stars: ✭ 20 (-44.44%)
Mutual labels:  question-answering, squad
Quality-Estimation1
机器翻译子任务-翻译质量评价-复现 WMT2018 阿里论文结果
Stars: ✭ 19 (-47.22%)
Mutual labels:  nmt
TransTQA
Author: Wenhao Yu ([email protected]). EMNLP'20. Transfer Learning for Technical Question Answering.
Stars: ✭ 12 (-66.67%)
Mutual labels:  question-answering

Multilingual Extractive Reading Comprehension by Runtime Machine Translation

We introduce the first extractive RC systems for non-English languages, without using language-specific RC training data, but instead by using an English RC model and an attention-based Neural Machine Translation (NMT) model [1].

The Overview

Contents

  1. Code
  2. Datasets
  3. Benchmarks
  4. Reference
  5. Contact

Code

We implemented our NMT and extractive RC models (BiDAF, BiDAF + Self Attention + ELMo) in PyTorch.

Installation

The installation steps are as follows:

git clone https://github.com/AkariAsai/extractive_rc_by_runtime_mt.git
cd extractive_rc_by_runtime_mt
pip install -r requirements.txt

For preprocessing steps, we used StanfordCoreNLP for English, mosesdecoder's scripts for French and MeCab for Japanese.
In our implementation, we call StanfordCoreNLP from py-corenlp, moses' tokenizer from its original perl scrips, and MeCab through mecab-python3.

For English

For py-corenlp, first make sure you have the Stanford CoreNLP server running. See this instruction.

For Japanese

For mecab-python3, we use Neologism dictionary for MeCab, instead of default Ochasen.
Please install the dictionary beforehand following the instruction on official page.

For French

You can install mosesdecoder's tokenizer by following the commands listed below.

wget https://github.com/moses-smt/mosesdecoder/archive/master.zip
unzip master.zip # files are to be extracted under extractive_rc_by_runtime_mt/mosesdecoder-master/
rm master.zip

For easy set up, we will replace the codes for tokenization process with mosestokenizer, the python wrapper for mose's tokenizer, sometime soon.

Train

The details of training RC and NMT models, see README.md under rc directory and README.md under nmt directory.

Evaluate

To run the evaluation on multilingual SQuAD dataset, you first need to train your own model for RC and NMT, or use pre-trained models.
You can download pretrained models from Google Drive.

For example, you can evaluate on French SQuAD using the pre-trained models by following the processes instructions below.

  1. Create params directory right under the home directory.
  2. Download the zipped files fren_nmt_params.tar.gz and rc_params.tar.gz into extractive_rc_by_runtime_mt/params directory, and extract necessary files.
tar -xzf params/fren_nmt_params.tar.gz && rm -f params/fren_nmt_params.tar.gz
tar -xzf params/rc_params.tar.gz && rm -f params/rc_params.tar.gz
  1. Run the evaluation based on the command below.
tar -xzf params/fren_nmt_params.tar.gz && rm -f params/fren_nmt_params.tar.gz
tar -xzf params/rc_params.tar.gz && rm -f params/rc_params.tar.gz
cd rc
python main.py evaluate_mlqa ../params/rc_params \
--evaluation-data-file https://s3-us-west-2.amazonaws.com/allennlp/datasets/squad/squad-dev-v1.1.json \
--unziped_archive_directory ../params/rc_params --elmo \
--trans_embedding_model ../params/fren_nmt_params/embedding.bin  \
--trans_encdec_model  ../params/fren_nmt_params/encdec.bin \
--trans_train_source ../params/fren_nmt_params/train_fr_lower_1.txt \
--trans_train_target ../params/fren_nmt_params/train_en_lower_1.txt \
-l Fr -v5 --beam

For the details of the command line options, see rc/evaluate_mlqa.py

Datasets

We provide the two datasets, (1) multilingual SQuAD Datasets (Japanese, French) and (2) {Japanese, French}-to-English bilingual corpora to train our NMT models for the extractive RC system.

Multilingual SQuAD Datasets

The Japanese and French datasets are created by manually translating the original SQuAD (v1.1) development dataset into Japanese and French.
These datasets contains 327 questions for each.
More details can be found in Section 3.3 (SQuAD Test Data for Japanese and French) in [1].

Multilingual SQuAD Datasets
Japanese japanese_squad.json
French french_squad.json

{Japanese, French}-to-English bilingual Corpora

Wikipedia-based {Japanese, French}-to-English bilingual corpora

To train the NMT model for specific language directions, we take advantage of constantly growing web resources to automatically construct parallel corpora, rather than assuming the availability of high quality parallel corpora of the target domain.
We constructed bilingual corpora from Wikipedia articles, using its inter-language links and hunalign, a sentence-level aligner.
More details can be found in Supplementary Material Section A (Details of Wikipedia-based Bilingual CorpusCreation) in [1].

To download preprocessed Wikipedia-based {Japanese, French}-to-English bilingual corpora, please run the commands below.

cd datasets
wget http://www.hal.t.u-tokyo.ac.jp/~asai/datasets/extractive_rc_by_runtime_mt/wiki_corpus.tar.gz
tar -xvf wiki_corpus.tar.gz && rm -f wiki_corpus.tar.gz
cd ..

The downloaded wikipedia-based corpora will be placed under datasets/wiki_corpus folder.
The training data includes 1,000,000 pairs, and the development data includes 2,000 pairs.

Manually translated question sentences

In our experiment, we also found that adding a small number of manually translated question sentences could further improve the extractive RC performance.
Here, we also provide the translated question sentences we actually used to train our NMT models.
The details pf the creation of these small parallel questions datasets can be found in Supplementary Material Section C (Details of Manually Translated SQuAD DatasetQuestions Creation) in [1].

question sentences
Japanese questions.ja, questions.en
French questions.fr, questions.en

Benchmarks

We provide the results of our proposed method on multilingual SQuAD datasets.

  • Japanese
methods F1 EM
Our Method 52.19 37.00
back-translation baseline 42.60 24.77
  • French
methods F1 EM
Our Method 61.88 40.67
back-translation baseline 44.02 23.54

Reference

Please cite [1] if you found the resources in this repository useful.

[1] Akari Asai, Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2018. "Multilingual Extractive Reading Comprehension by Runtime Machine Translation".

Contact

Please direct any questions to Akari Asai at [email protected].

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].