All Projects → ymcui → Cross-Lingual-MRC

ymcui / Cross-Lingual-MRC

Licence: Apache-2.0 license
Cross-Lingual Machine Reading Comprehension (EMNLP 2019)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Cross-Lingual-MRC

cdQA-ui
⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.
Stars: ✭ 19 (-71.21%)
Mutual labels:  reading-comprehension, bert
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+184.85%)
Mutual labels:  bert, cross-lingual
les-military-mrc-rank7
莱斯杯:全国第二届“军事智能机器阅读”挑战赛 - Rank7 解决方案
Stars: ✭ 37 (-43.94%)
Mutual labels:  reading-comprehension, bert
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+2889.39%)
Mutual labels:  bert, cross-lingual
AiSpace
AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0
Stars: ✭ 28 (-57.58%)
Mutual labels:  bert, cmrc2018
cmrc2019
A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019)
Stars: ✭ 118 (+78.79%)
Mutual labels:  reading-comprehension, bert
exams-qa
A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering
Stars: ✭ 25 (-62.12%)
Mutual labels:  reading-comprehension, cross-lingual
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+246.97%)
Mutual labels:  bert
KitanaQA
KitanaQA: Adversarial training and data augmentation for neural question-answering models
Stars: ✭ 58 (-12.12%)
Mutual labels:  bert
Sohu2019
2019搜狐校园算法大赛
Stars: ✭ 26 (-60.61%)
Mutual labels:  bert
oreilly-bert-nlp
This repository contains code for the O'Reilly Live Online Training for BERT
Stars: ✭ 19 (-71.21%)
Mutual labels:  bert
FasterTransformer
Transformer related optimization, including BERT, GPT
Stars: ✭ 1,571 (+2280.3%)
Mutual labels:  bert
SA-BERT
CIKM 2020: Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots
Stars: ✭ 71 (+7.58%)
Mutual labels:  bert
ganbert
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks
Stars: ✭ 205 (+210.61%)
Mutual labels:  bert
PromptPapers
Must-read papers on prompt-based tuning for pre-trained language models.
Stars: ✭ 2,317 (+3410.61%)
Mutual labels:  bert
DiscEval
Discourse Based Evaluation of Language Understanding
Stars: ✭ 18 (-72.73%)
Mutual labels:  bert
mixed-language-training
Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Task-oriented Dialogue Systems (AAAI-2020)
Stars: ✭ 29 (-56.06%)
Mutual labels:  cross-lingual
gender-unbiased BERT-based pronoun resolution
Source code for the ACL workshop paper and Kaggle competition by Google AI team
Stars: ✭ 42 (-36.36%)
Mutual labels:  bert
JointIDSF
BERT-based joint intent detection and slot filling with intent-slot attention mechanism (INTERSPEECH 2021)
Stars: ✭ 55 (-16.67%)
Mutual labels:  bert
banglabert
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…
Stars: ✭ 186 (+181.82%)
Mutual labels:  bert

中文说明 | English

Cross-Lingual Machine Reading Comprehension

This repository contains resources of the following EMNLP-IJCNLP 2019 paper.

Title: Cross-Lingual Machine Reading Comprehension
Authors: Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu
Link: https://www.aclweb.org/anthology/D19-1169/

pic.png

Directory Guide

root_directory
    |- src    # contains source codes for Dual BERT
    |- data   # Chinese-English (translated) bilingual datasets

Requirements

Python 2.7  
TensorFlow 1.12  

Performance

CMRC 2018

Model Dev Test Challenge
Dual BERT (w/o SQuAD) 65.8 / 86.3 70.4 / 88.1 23.8 / 47.9
Dual BERT (w/ SQuAD) 68.0 / 88.1 73.6 / 90.2 27.8 / 55.2

DRCD

Model Dev Test
Dual BERT (w/o SQuAD) 84.5 / 90.8 83.7 / 90.3
Dual BERT (w/ SQuAD) 86.0 / 92.1 85.4 / 91.6

Data

We provide machine translated data (English) of CMRC 2018 and DRCD for future studies. The translation was done on Google Neural Machine Translation (GNMT) engine, which achieves an average BLEU of 43.24 compared to previous best work 43.20 (Cheng et al., 2018) on MT02~MT08. Note that, GNMT is an evolving system and the translation performance is improving. So if you care about better translation performance, you might have to translate CMRC 2018 / DRCD dataset YOURSELF using GNMT or whatever you got.

Please check data folder for these data.

- MT02 MT03 MT04 MT05 MT06 MT08 Average
ASTfeature (Cheng et al., 2018) 46.10 44.07 45.61 44.06 44.44 34.94 43.20
GNMT (March 25, 2019) 46.26 43.40 44.17 44.14 43.86 37.61 43.24

Example:

{
  "version": "v1.0", 
  "data": [
    {
      "paragraphs": [
        {
          "id": "DEV_0", 
       	  "context": "《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武器情报、奥义字或擅长攻击类型等,请至战国无双系列1.由于乡里大辅先生因故去世,不得不寻找其他声优接手。从猛将传 and Z开始。2.战国无双 编年史的原创男女主角亦有专属声优。此模式是任天堂游戏谜之村雨城改编的新增模式。本作中共有20张战场地图(不含村雨城),后来发行的猛将传再新增3张战场地图。但游戏内战役数量繁多,部分地图会有兼用的状况,战役虚实则是以光荣发行的2本「战国无双3 人物真书」内容为主,以下是相关介绍。(注:前方加☆者为猛将传新增关卡及地图。)合并本篇和猛将传的内容,村雨城模式剔除,战国史模式可直接游玩。主打两大模式「战史演武」&「争霸演武」。系列作品外传作品", 
          "trans_context": "\"Sengoku Musou 3\" () is the third sequel of the Warring States unparalleled series developed by Glory and ω-force. This book is based on the three major stories, namely, \"Kanto Three Kingdoms\", which is dominated by Takeda Shingen, and others, \"Sengoku Sanjie\", which is dominated by Oda Nobunaga, and \"The Young Warrior of Guanyuan\", which is dominated by Ishida Sansei. , enrich the story inside the game. This section is devoted to the role, for weapons information, esoteric words or good at attack types, please go to the Warring States unparalleled series 1. Because Mr. Daisuke of the town died of death, he had to find other seiyuu to take over. Starting from the fierce will pass and Z. 2. The original male and female protagonists of the Warring States Muscular Chronicle also have exclusive seiyuu. This mode is a new model adapted from the Nintendo game puzzle village Yucheng. There are a total of 20 battlefield maps in this work (excluding the village Yucheng), and the later release of the military will add three more battlefield maps. However, there are a large number of in-game campaigns, and some maps will be used in combination. The battles are based on the contents of two \"Sengoku Musou 3 Characters\" published in honor. The following are related introductions. (Note: Adding ☆ to the front will add new levels and maps.) Combine this and the content of the slamming, the village rain city mode is removed, the Warring States history mode can be played directly. The main two modes are \"war history\" and \"battle play\". Series of works",
          "qas": [
            {
              "question": "《战国无双3》是由哪两个公司合作开发的?", 
              "trans_question": "Which two companies are jointly developed by \"Warring States Warriors 3\"?",
              "id": "DEV_0_QUERY_0", 
              "answers": [
                {
                  "text": "光荣和ω-force", 
                  "trans_text": "Glorious and ω-force", 
                  "answer_start": 11, 
                  "trans_aligned_text": "and ω-force", 
                  "trans_aligned_start": 102
                }, 
                {
                  "text": "光荣和ω-force", 
                  "trans_text": "Glorious and ω-force", 
                  "answer_start": 11, 
                  "trans_aligned_text": "and ω-force", 
                  "trans_aligned_start": 102
                }, 
                {
                  "text": "光荣和ω-force", 
                  "trans_text": "Glorious and ω-force", 
                  "answer_start": 11, 
                  "trans_aligned_text": "and ω-force", 
                  "trans_aligned_start": 102
                }
              ]
            }]
        }]
    }]
}

Usage

You may use another hyper-parameter set to adapt to your computing device, but it may require further tuning, especially learning_rate and num_train_epoch.

Pre-training with SQuAD

For all experiments marked with w/ SQuAD, we use SQuAD 1.1 training set to train on the multi-lingual BERT, and further fine-tune on down-stream tasks.
SQuAD data: https://worksheets.codalab.org/worksheets/0x62eefc3e64e04430a1a24785a9293fff
Multi-lingual BERT: https://github.com/google-research/bert/blob/master/multilingual.md
Training script (please use official BERT script for training):

VOCAB_FILE=YOUR_PATH_TO_BERT/vocab.txt
BERT_CONFIG_FILE=YOUR_PATH_TO_BERT/bert_config.json
INIT_CKPT=YOUR_PATH_TO_BERT/bert_model.ckpt
DATA_DIR=YOUR_PATH_TO_DATA
MODEL_DIR=YOUR_PATH_TO_MODEL_DIR
TPU_NAME=tpu-v2-0
TPU_ZONE=us-central1-f

python run_squad.py \
  --vocab_file=${VOCAB_FILE} \
  --bert_config_file=${BERT_CONFIG_FILE} \
  --init_checkpoint=${INIT_CKPT} \
  --do_train=True \
  --train_file=${DATA_DIR}/train-v1.1.json \
  --do_predict=True \
  --predict_file=${DATA_DIR}/dev-v1.1.json \
  --train_batch_size=64 \
  --predict_batch_size=64 \
  --num_train_epochs=3.0 \
  --max_seq_length=512 \
  --doc_stride=128 \
  --learning_rate=3e-5 \
  --version_2_with_negative=False \
  --output_dir=${MODEL_DIR} \
  --do_lower_case=False \
  --use_tpu=True \
  --tpu_name=${TPU_NAME} \
  --tpu_zone=${TPU_ZONE}

After training, the performance on the development set should be in the range of 81.582.1 (EM), 88.889.2 (F1).

CMRC 2018

Note that, you have to submit your model to CMRC 2018 official to get the scores on the test/challenge set.

VOCAB_FILE=YOUR_PATH_TO_BERT/vocab.txt
BERT_CONFIG_FILE=YOUR_PATH_TO_BERT/bert_config.json
INIT_CKPT=YOUR_PATH_TO_BERT/bert_model.ckpt
DATA_DIR=YOUR_PATH_TO_DATA
MODEL_DIR=YOUR_PATH_TO_MODEL_DIR
TPU_NAME=tpu-v2-1
TPU_ZONE=us-central1-f

python run_clmrc.py \
  --vocab_file=${VOCAB_FILE} \
  --bert_config_file=${BERT_CONFIG_FILE} \
  --init_checkpoint=${INIT_CKPT} \
  --do_train=True \
  --train_file=${DATA_DIR}/cmrc2018_train_aligned.json \
  --do_predict=True \
  --predict_file=${DATA_DIR}/cmrc2018_dev_aligned.json \
  --train_batch_size=64 \
  --predict_batch_size=64 \
  --num_train_epochs=2 \
  --max_seq_length=512 \
  --max_answer_length=40 \
  --doc_stride=128 \
  --learning_rate=2e-5 \
  --save_checkpoints_steps=1000 \
  --rand_seed=$rnd \
  --do_lower_case=False \
  --output_dir=${MODEL_DIR} \
  --use_tpu=True \
  --tpu_name=${TPU_NAME} \
  --tpu_zone=${TPU_ZONE}

DRCD

VOCAB_FILE=YOUR_PATH_TO_BERT/vocab.txt
BERT_CONFIG_FILE=YOUR_PATH_TO_BERT/bert_config.json
INIT_CKPT=YOUR_PATH_TO_BERT/bert_model.ckpt
DATA_DIR=YOUR_PATH_TO_DATA
MODEL_DIR=YOUR_PATH_TO_MODEL_DIR
TPU_NAME=tpu-v2-2
TPU_ZONE=us-central1-f

python run_clmrc.py \
  --vocab_file=${VOCAB_FILE} \
  --bert_config_file=${BERT_CONFIG_FILE} \
  --init_checkpoint=${INIT_CKPT} \
  --do_train=True \
  --train_file=${DATA_DIR}/DRCD_training_aligned.json \
  --do_predict=True \
  --predict_file=${DATA_DIR}/DRCD_dev_aligned.json \
  --train_batch_size=64 \
  --predict_batch_size=64 \
  --num_train_epochs=3 \
  --max_seq_length=512 \
  --max_answer_length=30 \
  --doc_stride=128 \
  --learning_rate=2e-5 \
  --save_checkpoints_steps=1000 \
  --rand_seed=$rnd \
  --do_lower_case=False \
  --output_dir=${MODEL_DIR} \
  --use_tpu=True \
  --tpu_name=${TPU_NAME} \
  --tpu_zone=${TPU_ZONE}

Citation

If you use the data or codes in this repository, please cite our paper.

@inproceedings{cui-emnlp2019-clmrc,
    title = "Cross-Lingual Machine Reading Comprehension",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1169",
    doi = "10.18653/v1/D19-1169",
    pages = "1586--1595",
}

Acknowledgement

We would like to thank Google TensorFlow Research Cloud (TFRC) Program for partially supporting this research.

Issues

If there is any problem, please submit a GitHub Issue.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].