All Projects → jfainberg → Self_dialogue_corpus

jfainberg / Self_dialogue_corpus

Licence: bsd-3-clause
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Self dialogue corpus

Nndial
NNDial is an open source toolkit for building end-to-end trainable task-oriented dialogue models. It is released by Tsung-Hsien (Shawn) Wen from Cambridge Dialogue Systems Group under Apache License 2.0.
Stars: ✭ 332 (+238.78%)
Mutual labels:  dialogue
Nlg Eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Stars: ✭ 822 (+738.78%)
Mutual labels:  dialogue
Geneva
Code to train and evaluate the GeNeVA-GAN model for the GeNeVA task proposed in our ICCV 2019 paper "Tell, Draw, and Repeat: Generating and modifying images based on continual linguistic instruction"
Stars: ✭ 71 (-27.55%)
Mutual labels:  dialogue
Nlp Progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Stars: ✭ 19,518 (+19816.33%)
Mutual labels:  dialogue
Rnnlg
RNNLG is an open source benchmark toolkit for Natural Language Generation (NLG) in spoken dialogue system application domains. It is released by Tsung-Hsien (Shawn) Wen from Cambridge Dialogue Systems Group under Apache License 2.0.
Stars: ✭ 487 (+396.94%)
Mutual labels:  dialogue
Nlp Library
curated collection of papers for the nlp practitioner 📖👩‍🔬
Stars: ✭ 1,025 (+945.92%)
Mutual labels:  dialogue
Yarneditor
A tool for writing interactive dialogue in games!
Stars: ✭ 292 (+197.96%)
Mutual labels:  dialogue
Som Dst
SOM-DST: Efficient Dialogue State Tracking by Selectively Overwriting Memory (ACL 2020)
Stars: ✭ 79 (-19.39%)
Mutual labels:  dialogue
Cdial Gpt
A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
Stars: ✭ 596 (+508.16%)
Mutual labels:  dialogue
Nlp Paper
自然语言处理领域下的对话语音领域,整理相关论文(附阅读笔记),复现模型以及数据处理等(代码含TensorFlow和PyTorch两版本)
Stars: ✭ 67 (-31.63%)
Mutual labels:  dialogue
Multiwoz
Source code for end-to-end dialogue model from the MultiWOZ paper (Budzianowski et al. 2018, EMNLP)
Stars: ✭ 384 (+291.84%)
Mutual labels:  dialogue
Dialogic
💬 Create dialogs, characters and scenes to display conversations in your Godot games.
Stars: ✭ 414 (+322.45%)
Mutual labels:  dialogue
Dialogue
Stars: ✭ 49 (-50%)
Mutual labels:  dialogue
Meld
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation
Stars: ✭ 373 (+280.61%)
Mutual labels:  dialogue
Dialogpt
Large-scale pretraining for dialogue
Stars: ✭ 1,177 (+1101.02%)
Mutual labels:  dialogue
Seq2seq Chatbot For Keras
This repository contains a new generative model of chatbot based on seq2seq modeling.
Stars: ✭ 322 (+228.57%)
Mutual labels:  dialogue
Rezonator
Rezonator: Dynamics of human engagement
Stars: ✭ 25 (-74.49%)
Mutual labels:  dialogue
Msr Nlp Projects
This is a list of open-source projects at Microsoft Research NLP Group
Stars: ✭ 92 (-6.12%)
Mutual labels:  dialogue
Dialogue Understanding
This repository contains PyTorch implementation for the baseline models from the paper Utterance-level Dialogue Understanding: An Empirical Study
Stars: ✭ 77 (-21.43%)
Mutual labels:  dialogue
Dream
DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension
Stars: ✭ 60 (-38.78%)
Mutual labels:  dialogue

The Self-dialogue Corpus

This is an early release of the Self-dialogue Corpus containing 24,165 conversations, or 3,653,313 words, across 23 topics. For more information on the data, please see our corpus paper or our submission to the Alexa Prize.

Statistics

Category Count
Topics 23
Conversations 24,165
Words 3,653,313
Turns 141,945
Unique users 2,717
Conversations per user ~9
Unique tokens 117,068

Topics include movies, music, sports, and subtopics within these.

Using the data

  • corpus contains the raw CSVs from Amazon Mechanical Turk, sorted by individual tasks (topics);
  • blocked_workers.txt lists workers who did not comply with the requirements of the tasks, these are omitted by default;
  • get_data.py is a preprocessing script which will format the CSVs into text (by default saved to dialogues), along with various options (see below).

get_data.py

Example usage: python get_data.py. This will by default read from corpus and write to dialogues.

Optional arguments:

  • --inDir Directory to read corpus from
  • --outDir Directory to write processed files
  • --output-naming whether to name output files with integers (integer) or by assignment_id (assignment_id);
  • --remove-punctuation removes punctuation from the output;
  • --set-case sets case of output to original, upper or lower;
  • --exclude-topic excludes any of the topics (or subdirectories of corpus), e.g. --exclude-topic music;
  • --include-only includes only the given topics, e.g. --include-only music.

Citation

For research using this data, please cite:

@article{fainberg2018talking,
  title={Talking to myself: self-dialogues as data for conversational agents},
  author={Fainberg, Joachim and Krause, Ben and Dobre, Mihai and Damonte, Marco and Kahembwe, Emmanuel and Duma, Daniel and Webber, Bonnie and Fancellu, Federico},
  journal={arXiv preprint arXiv:1809.06641},
  year={2018}
}
@article{krause2017edina,
  title={Edina: Building an Open Domain Socialbot with Self-dialogues},
  author={Krause, Ben and Damonte, Marco and Dobre, Mihai and Duma, Daniel and Fainberg, Joachim and Fancellu, Federico and Kahembwe, Emmanuel and Cheng, Jianpeng and Webber, Bonnie},
  journal={Alexa Prize Proceedings},
  year={2017}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].