All Projects → ymcui → Chinese Rc Datasets

ymcui / Chinese Rc Datasets

Licence: cc-by-sa-4.0
Collections of Chinese reading comprehension datasets

Projects that are alternatives of or similar to Chinese Rc Datasets

Reading Comprehension Question Answering Papers
Survey on Machine Reading Comprehension
Stars: ✭ 101 (-36.48%)
Mutual labels:  question-answering
Dynamic Memory Networks Plus Pytorch
Implementation of Dynamic memory networks plus in Pytorch
Stars: ✭ 123 (-22.64%)
Mutual labels:  question-answering
Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (-13.84%)
Mutual labels:  question-answering
Chatbot
Русскоязычный чатбот
Stars: ✭ 106 (-33.33%)
Mutual labels:  question-answering
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+2044.03%)
Mutual labels:  question-answering
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-22.01%)
Mutual labels:  question-answering
Neuronblocks
NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego
Stars: ✭ 1,356 (+752.83%)
Mutual labels:  question-answering
Pytorch Question Answering
Important paper implementations for Question Answering using PyTorch
Stars: ✭ 154 (-3.14%)
Mutual labels:  question-answering
Clicr
Machine reading comprehension on clinical case reports
Stars: ✭ 123 (-22.64%)
Mutual labels:  question-answering
Question Answering
TensorFlow implementation of Match-LSTM and Answer pointer for the popular SQuAD dataset.
Stars: ✭ 133 (-16.35%)
Mutual labels:  question-answering
Tableqa
AI Tool for querying natural language on tabular data.
Stars: ✭ 109 (-31.45%)
Mutual labels:  question-answering
Dynamic Coattention Network Plus
Dynamic Coattention Network Plus (DCN+) TensorFlow implementation. Question answering using Deep NLP.
Stars: ✭ 117 (-26.42%)
Mutual labels:  question-answering
Medquad
Medical Question Answering Dataset of 47,457 QA pairs created from 12 NIH websites
Stars: ✭ 129 (-18.87%)
Mutual labels:  question-answering
Ama
[[I'm slow at replying these days, but I hope to get back to answering questions eventually]] Ask me anything!
Stars: ✭ 102 (-35.85%)
Mutual labels:  question-answering
Question answering models
This repo collects and re-produces models related to domains of question answering and machine reading comprehension
Stars: ✭ 139 (-12.58%)
Mutual labels:  question-answering
Flexneuart
Flexible classic and NeurAl Retrieval Toolkit
Stars: ✭ 99 (-37.74%)
Mutual labels:  question-answering
Knowledge Aware Reader
PyTorch implementation of the ACL 2019 paper "Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader"
Stars: ✭ 123 (-22.64%)
Mutual labels:  question-answering
Nspm
🤖 Neural SPARQL Machines for Knowledge Graph Question Answering.
Stars: ✭ 156 (-1.89%)
Mutual labels:  question-answering
Cape Webservices
Entrypoint for all backend cape webservices
Stars: ✭ 149 (-6.29%)
Mutual labels:  question-answering
Kbqa Ar Smcnn
Question answering over Freebase (single-relation)
Stars: ✭ 129 (-18.87%)
Mutual labels:  question-answering

Chinese Machine Reading Comprehension Datasets

Note that, this repository will be updated irregularly.

If you find this repository helpful, please press the star button. Moreover, if you would like to use or repost the content in this repository, please indicate the orignal author and source link.

Content

Section Description
Chinese Reading Comprehension Datasets Describe public Chinese RC datasets
State-of-the-art Systems State-of-the-art systems and results
Chinese Reading Comprehension Evaluations and Competitions Introductions to Chinese RC competitions

Chinese Reading Comprehension Datasets

Here I list several Chinese reading comprehension datasets that are PUBLICLY available (with appropriate technical report or paper). If I missed something, feel free to inform me. Unless indicated, the datasets are in simplified Chinese.

Dataset Genre Query Type Answer Type Document # Query # Download
People Daily & Children's Fairy Tale [1] news & tale Cloze word 28K 100K link
WebQA [2] Web User log entity - 42K link
CMRC 2017 [3] news Cloze & Query word - 364K link
DuReader [4] Web User log free form 1M 200K link
CMRC 2018 [5] Wiki Query Span - 18K link
DRCD [6](tranditional Chinese) Wiki Query Span - 34K link
C^3 [7] mixed Query choice 14K 24K link
CMRC 2019 [8] Story cloze Sentence 1K 100K link
ChID [9] varies cloze idiom 580K 729K link

[1] (Cui et al., 2016) Consensus Attention-based Neural Networks for Chinese Reading Comprehension. In COLING 2016. https://aclanthology.info/papers/C16-1167/c16-1167

[2] (Li et al., 2016) Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. In arXiv. https://arxiv.org/abs/1607.06275

[3] (Cui et al., 2018) Dataset for the First Evaluation on Chinese Machine Reading Comprehension. In LREC 2018. http://www.lrec-conf.org/proceedings/lrec2018/summaries/32.html

[4] (He et al., 2018) DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. In ACL 2018 MRQA Workshop. https://aclanthology.info/papers/W18-2605/w18-2605

[5] (Cui et al., 2018) A Span-Extraction Dataset for Chinese Machine Reading Comprehension. In arXiv. https://arxiv.org/abs/1810.07366

[6] (Shao et al., 2018) DRCD: a Chinese Machine Reading Comprehension Dataset. In arXiv. https://arxiv.org/abs/1806.00920

[7] (Sun et al., 2019) Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension. https://arxiv.org/abs/1904.09679

[8] (Cui et al., 2019) https://github.com/ymcui/cmrc2019

[9] (Zheng et al., 2019) ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. https://aclweb.org/anthology/papers/P/P19/P19-1075/

State-of-the-art Systems

Here I list several state-of-the-art systems (published / unpublished) for these datasets. There is a big chance that I missed something. So feel free to inform me new entries on Issue tab.

People Daily & Children's Fairy Tale

System PD-DEV PD-TEST CFT-TEST-AUTO CFT-TEST-HUMAN Note
SAW Reader (Zhang et al., 2018) 72.8 75.1 - 43.8 -
CAW Reader (Zhang et al., 2018) 69.4 70.5 - 39.7 -
CAS Reader (Cui et al., 2016) 65.2 68.1 41.3 35.0 -
AS Reader (Cui et al., 2016) 64.1 67.2 40.9 33.1 -

CMRC 2017

Leaderboard: https://hfl-rc.github.io/cmrc2017/leaderboard/

Cloze Track

System DEV TEST Note
6ESTATES PTE LTD (ensemble) 81.85 81.90 -
SJTU BCMI-NLP (ensemble) 78.35 80.67 -
YunSiChuangZhi (ensemble) 79.20 80.27 -
SAW Reader (Zhang et al., 2018) 78.95 78.80 -
CAW Reader (Zhang et al., 2018) 77.95 78.50 -
Word + Char + BPE-FRQ (Zhang et al., 2018) 79.05 78.83 -

User Query Track

System DEV TEST Note
ECNU (ensemble) 90.45 69.53 -
SXU-3 (single model) 47.80 49.07 -
ZZU (single model) 31.10 32.53 -

DuReader

Leaderboard: http://ai.baidu.com/broad/leaderboard?dataset=dureader

System ROUGE-L BLEU-4 Note
AliReader 63.48 61.54 -
NI-Reader (ensemble) 63.38 59.23 -
mrc_try_mingyan (single model) 62.20 59.72 -
(Yan et al., 2018) 50.71 49.39 -
(Li et al., 2018) 44.95 42.68 -
(Wang et al., 2018) 44.18 40.97 -
(Xu et al., 2018) 39.60 34.76 -
Match-LSTM (He et al., 2018) 39.2 31.9 -
BiDAF (He et al., 2018) 39.0 31.8 -

CMRC 2018

Leaderboard: https://hfl-rc.github.io/cmrc2018/open_challenge/

System DEV-EM DEV-F1 TEST-EM TEST-F1 CHALLENGE-EM CHALLENGE-F1 Note
P-Reader (single model) 59.894 81.499 65.189 84.386 15.079 39.583 -
GM-Reader (ensemble) 58.931 80.069 64.045 83.046 15.675 37.315 -
MCA-Reader (ensemble) 66.698 85.538 71.175 88.090 15.476 37.104 -
Z-Reader (single model) 79.776 92.696 74.178 88.145 13.889 37.422 -
SRC->DS(±) (Yang et al., 2019) 49.2 65.4 - - - - -

More detailed results can be obtained in CMRC 2018 Overview. Note that, some of the submission are using development set for training as well.

DRCD

System DEV-EM DEV-F1 TEST-EM TEST-EM Note
SRC + DS(±) (Yang et al., 2019) 55.4 67.7 - - -
r-net (single model) - - 29.1 44.4 -

C^3

System DEV-1A TEST-1A DEV-1B TEST-1B DEV-2A TEST-2A DEV-2B TEST-2B Note
BERT_CN (Sun et al., 2019) 63.0 62.6 62.3 62.1 36.7 26.2 34.7 31.3 -

Chinese Reading Comprehension Evaluations and Competitions

Along with the release of these datasets, there are also several Chinese Reading Comprehension evaluation workshops or competitions which further accelerate the research on this topic.

  1. The First Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2017)
    Host: CIPS-CL, Joint Laboratory of HIT and iFLYTEK Research (HFL), iFLYTEK Co. Ltd
    Competition Type: Cloze-style RC, User Query RC
  1. The Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018)
    Host: CIPS-CL, Joint Laboratory of HIT and iFLYTEK Research (HFL), iFLYTEK Co. Ltd
    Competition Type: Span-Extraction RC
  1. 2018 NLP Challenge on Machine Reading Comprehension
    Host: CCF, CIPSC, Baidu Inc.
    Competition Type: Open-Domain RC
  1. CIPS-SOGOU QA Competition
    Host: CIPSC, SOGOU
    Competition Type: Factoid QA, Non-Factoid QA
  1. The Third Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2019)
    Host: CIPS-CL, Joint Laboratory of HIT and iFLYTEK Research (HFL), iFLYTEK Co. Ltd
    Competition Type: Sentence Cloze
  1. 2019 NLP Language and Intelligence Challenge
    Host: CCF, CIPSC, Baidu Inc.
    Competition Type: Open-Domain RC
  1. Chinese Idiom Understanding Contest
    Host: CCF, Tsinghua University
    Competition Type: Cloze Test

Contact

For any problems, please leave a message in the Github Issues.

Disclaimer

Any subjective comments in this repository only represents the idea of the owner (ymcui), and does not represent the claims of any organizations.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].