Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → nlpdata → Dream

nlpdata / Dream

Licence: other

DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension

Programming Languages

python

139335 projects - #7 most used programming language

Labels

dataset dialogue

Projects that are alternatives of or similar to Dream

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

Stars: ✭ 101 (+68.33%)

Mutual labels: dialogue, dataset

Dstc7 End To End Conversation Modeling

Grounded conversational dataset for end-to-end conversational AI (official DSTC7 data)

Stars: ✭ 141 (+135%)

Mutual labels: dialogue, dataset

Dstc8 Schema Guided Dialogue

The Schema-Guided Dialogue Dataset

Stars: ✭ 277 (+361.67%)

Mutual labels: dialogue, dataset

Php Ml

PHP-ML - Machine Learning library for PHP

Stars: ✭ 7,900 (+13066.67%)

Mutual labels: dataset

Images Web Crawler

This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web, download images, rename / resize / covert the images and merge folders..

Stars: ✭ 51 (-15%)

Mutual labels: dataset

Covidnet Ct

COVID-Net Open Source Initiative - Models and Data for COVID-19 Detection in Chest CT

Stars: ✭ 57 (-5%)

Mutual labels: dataset

Char Rnn Tensorflow

Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow

Stars: ✭ 58 (-3.33%)

Mutual labels: dataset

Chinesetrafficpolicepose

Detects Chinese traffic police commanding poses 检测中国交警指挥手势

Stars: ✭ 49 (-18.33%)

Mutual labels: dataset

Animegan

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

Stars: ✭ 1,095 (+1725%)

Mutual labels: dataset

Clothing Detection Dataset

Clothing detection dataset

Stars: ✭ 55 (-8.33%)

Mutual labels: dataset

Quandl Python

Stars: ✭ 1,076 (+1693.33%)

Mutual labels: dataset

Covid 19

Novel Coronavirus 2019 time series data on cases

Stars: ✭ 1,060 (+1666.67%)

Mutual labels: dataset

Cinemanet

Stars: ✭ 57 (-5%)

Mutual labels: dataset

Courseraforums

Anonymized versions of the discussion threads from the forums of 60 Coursera MOOCs

Stars: ✭ 50 (-16.67%)

Mutual labels: dataset

Geodata Br

Free open public domain geographic data of Brazil available in multiple languages and formats.

Stars: ✭ 57 (-5%)

Mutual labels: dataset

Distil

💧 In memory dataset filtering, inspired by snikch/aggro

Stars: ✭ 49 (-18.33%)

Mutual labels: dataset

View Finding Network

A deep ranking network that learns to find good compositions in a photograph.

Stars: ✭ 57 (-5%)

Mutual labels: dataset

Coarij

Corpus of Annual Reports in Japan

Stars: ✭ 55 (-8.33%)

Mutual labels: dataset

Knyfe

knyfe is a python utility for rapid exploration of datasets.

Stars: ✭ 54 (-10%)

Mutual labels: dataset

Fifa Fut Data

Web-scraping script that writes the data of all players from FutHead and FutBin to a CSV file or a DB

Stars: ✭ 55 (-8.33%)

Mutual labels: dataset

View All Similar Projects ➔

DREAM

Overview

This repository maintains DREAM, a multiple-choice Dialogue-based REAding comprehension exaMination dataset.

Paper: https://arxiv.org/abs/1902.00164

@article{sundream2018,
  title={{DREAM}: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension},
  author={Sun, Kai and Yu, Dian and Chen, Jianshu and Yu, Dong and Choi, Yejin and Cardie, Claire},
  journal={Transactions of the Association for Computational Linguistics},
  year={2019},
  url={https://arxiv.org/abs/1902.00164v1}
}

Leaderboard: https://dataset.org/dream/

Files in this repository:

data folder: the dataset.
annotation folder: question type annotations.
dsw++ folder: code of DSW++.
ftlm++ folder: code of FTLM++.
license.txt: the license of DREAM.
websites.txt: list of websites used for the data collection of DREAM.

Dataset

data/train.json, data/dev.json and data/test.json are the training, development and test sets, respectively. The format of them is as follows:

[
  [
    [
      dialogue 1 / turn 1,
      dialogue 1 / turn 2,
      ...
    ],
    [
      {
        "question": dialogue 1 / question 1,
        "choice": [
          dialogue 1 / question 1 / answer option 1,
          dialogue 1 / question 1 / answer option 2,
          dialogue 1 / question 1 / answer option 3
        ],
        "answer": dialogue 1 / question 1 / correct answer option
      },
      {
        "question": dialogue 1 / question 2,
        "choice": [
          dialogue 1 / question 2 / answer option 1,
          dialogue 1 / question 2 / answer option 2,
          dialogue 1 / question 2 / answer option 3
        ],
        "answer": dialogue 1 / question 2 / correct answer option
      },
      ...
    ],
    dialogue 1 / id
  ],
  [
    [
      dialogue 2 / turn 1,
      dialogue 2 / turn 2,
      ...
    ],
    [
      {
        "question": dialogue 2 / question 1,
        "choice": [
          dialogue 2 / question 1 / answer option 1,
          dialogue 2 / question 1 / answer option 2,
          dialogue 2 / question 1 / answer option 3
        ],
        "answer": dialogue 2 / question 1 / correct answer option
      },
      {
        "question": dialogue 2 / question 2,
        "choice": [
          dialogue 2 / question 2 / answer option 1,
          dialogue 2 / question 2 / answer option 2,
          dialogue 2 / question 2 / answer option 3
        ],
        "answer": dialogue 2 / question 2 / correct answer option
      },
      ...
    ],
    dialogue 2 / id
  ],
  ...
]

Question Type Annotations

annotation/{annotator1,annotator2}_{dev,test}.json are the question type annotations for 25% questions in the development and test sets from two annotators.

In accordance with the format explanation above, the question index starts from 1.

We adopt the following abbreviations:

Abbreviation	Question Type
m	matching
s	summary
l	logic
a	arithmetic
c	commonsense

Code

DSW++
1. Copy the data folder data to dsw++/.
2. Download numberbatch-en-17.06.txt.gz from https://github.com/commonsense/conceptnet-numberbatch, and put it into dsw++/data/.
3. In dsw++, execute python run.py.
4. Execute python evaluate.py to get the accuracy on the test set.
FTLM++
1. Download the pre-trained language model from https://github.com/openai/finetune-transformer-lm, and copy the model folder model to ftlm++/.
2. Copy the data folder data to ftlm++/.
3. In ftlm++, execute python train.py --submit. You may want to also specify --n_gpu (e.g., 4) and --n_batch (e.g., 2) based on your environment.
4. Execute python evaluate.py to get the accuracy on the test set.

Note: The results you get may be slightly different from those reported in the paper. For example, the dev and test accuracy for DSW++ in this repository is 51.2 and 50.2 respectively, while the reported accuracy in the paper is 51.4 and 50.1. That is due to (1) we refactor the code with different dependencies to make it portable, and (2) some of the code is non-deterministic due to GPU non-determinism.

Environment: The code has been tested with Python 3.6/3.7 and Tensorflow 1.4

Other Useful Code

You can refer to this repository for a finetuned transformer baseline based on BERT.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 60

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗