All Projects → nlpdata → C3

nlpdata / C3

Licence: other
Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to C3

Dstc7 End To End Conversation Modeling
Grounded conversational dataset for end-to-end conversational AI (official DSTC7 data)
Stars: ✭ 141 (+39.6%)
Mutual labels:  dialogue, dataset
Dstc8 Schema Guided Dialogue
The Schema-Guided Dialogue Dataset
Stars: ✭ 277 (+174.26%)
Mutual labels:  dialogue, dataset
Dream
DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension
Stars: ✭ 60 (-40.59%)
Mutual labels:  dialogue, dataset
Eval Vislam
Toolkit for VI-SLAM evaluation.
Stars: ✭ 89 (-11.88%)
Mutual labels:  dataset
Msr Nlp Projects
This is a list of open-source projects at Microsoft Research NLP Group
Stars: ✭ 92 (-8.91%)
Mutual labels:  dialogue
Deepweeds
A Multiclass Weed Species Image Dataset for Deep Learning
Stars: ✭ 96 (-4.95%)
Mutual labels:  dataset
Universal Data Tool
Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
Stars: ✭ 1,356 (+1242.57%)
Mutual labels:  dataset
Cesi
WWW 2018: CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information
Stars: ✭ 85 (-15.84%)
Mutual labels:  dataset
Botml
Powerful markup language for modern chatbots.
Stars: ✭ 98 (-2.97%)
Mutual labels:  dialogue
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-4.95%)
Mutual labels:  dataset
Body reconstruction references
Paper, dataset and code collection on human body reconstruction
Stars: ✭ 96 (-4.95%)
Mutual labels:  dataset
Face landmark dnn
Face Landmark Detector based on Mobilenet V1
Stars: ✭ 92 (-8.91%)
Mutual labels:  dataset
Exposure correction
Reference code for the paper "Learning Multi-Scale Photo Exposure Correction", CVPR 2021.
Stars: ✭ 98 (-2.97%)
Mutual labels:  dataset
Core50
CORe50: a new Dataset and Benchmark for Continual Learning
Stars: ✭ 91 (-9.9%)
Mutual labels:  dataset
Dataset
Crop/Weed Field Image Dataset
Stars: ✭ 98 (-2.97%)
Mutual labels:  dataset
Hands Detection
Hands video tracker using the Tensorflow Object Detection API and Faster RCNN model. The data used is the Hand Dataset from University of Oxford.
Stars: ✭ 87 (-13.86%)
Mutual labels:  dataset
Self dialogue corpus
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
Stars: ✭ 98 (-2.97%)
Mutual labels:  dialogue
Persian Swear Words
دیتاست کلمات نامناسب و بد فارسی برای فیلتر کردن متن ها
Stars: ✭ 95 (-5.94%)
Mutual labels:  dataset
Pytreebank
😡😇 Stanford Sentiment Treebank loader in Python
Stars: ✭ 93 (-7.92%)
Mutual labels:  dataset
Dataloaders
Pytorch and TensorFlow data loaders for several audio datasets
Stars: ✭ 97 (-3.96%)
Mutual labels:  dataset

C3

Overview

This repository maintains C3, the first free-form multiple-Choice Chinese machine reading Comprehension dataset.

@article{sun2019investigating,
  title={Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension},
  author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
  url={https://arxiv.org/abs/1904.09679v3}
}

Files in this repository:

  • license.txt: the license of C3.
  • data/c3-{m,d}-{train,dev,test}.json: the dataset files, where m and d represent "mixed-genre" and "dialogue", respectively. The data format is as follows.
[
  [
    [
      document 1
    ],
    [
      {
        "question": document 1 / question 1,
        "choice": [
          document 1 / question 1 / answer option 1,
          document 1 / question 1 / answer option 2,
          ...
        ],
        "answer": document 1 / question 1 / correct answer option
      },
      {
        "question": document 1 / question 2,
        "choice": [
          document 1 / question 2 / answer option 1,
          document 1 / question 2 / answer option 2,
          ...
        ],
        "answer": document 1 / question 2 / correct answer option
      },
      ...
    ],
    document 1 / id
  ],
  [
    [
      document 2
    ],
    [
      {
        "question": document 2 / question 1,
        "choice": [
          document 2 / question 1 / answer option 1,
          document 2 / question 1 / answer option 2,
          ...
        ],
        "answer": document 2 / question 1 / correct answer option
      },
      {
        "question": document 2 / question 2,
        "choice": [
          document 2 / question 2 / answer option 1,
          document 2 / question 2 / answer option 2,
          ...
        ],
        "answer": document 2 / question 2 / correct answer option
      },
      ...
    ],
    document 2 / id
  ],
  ...
]
  • annotation/c3-{m,d}-{dev,test}.txt: question type annotations. Each file contains 150 annotated instances. We adopt the following abbreviations:
Abbreviation Question Type
Matching m Matching
Prior knowledge l Linguistic
s Domain-specific
c-a Arithmetic
c-o Connotation
c-e Cause-effect
c-i Implication
c-p Part-whole
c-d Precondition
c-h Scenario
c-n Other
Supporting Sentences 0 Single Sentence
1 Multiple sentences
2 Independent
  • bert folder: code of Chinese BERT, BERT-wwm, and BERT-wwm-ext baselines. The code is derived from this repository. Below are detailed instructions on fine-tuning Chinese BERT on C3.
    1. Download and unzip the pre-trained Chinese BERT from here, and set up the environment variable for BERT by export BERT_BASE_DIR=/PATH/TO/BERT/DIR.
    2. Copy the dataset folder data to bert/.
    3. In bert, execute python convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin.
    4. Execute python run_classifier.py --task_name c3 --do_train --do_eval --data_dir . --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin --max_seq_length 512 --train_batch_size 24 --learning_rate 2e-5 --num_train_epochs 8.0 --output_dir c3_finetuned --gradient_accumulation_steps 3.
    5. The resulting fine-tuned model, predictions, and evaluation results are stored in bert/c3_finetuned.

Note:

  1. Fine-tuning Chinese BERT-wwm or BERT-wwm-ext follows the same steps except for downloading their pre-trained language models.
  2. There is randomness in model training, so you may want to run multiple times to choose the best model based on development set performance. You may also want to set different seeds (specify --seed when executing run_classifier.py).
  3. Depending on your hardware, you may need to change gradient_accumulation_steps.
  4. The code has been tested with Python 3.6 and PyTorch 1.0.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].