sz128 / Nlu_datasets_with_task_oriented_dialogue
datasets of natural language understanding and dialogue state tracking
Stars: ✭ 104
Labels
Projects that are alternatives of or similar to Nlu datasets with task oriented dialogue
Pytorch Cpp
C++ Implementation of PyTorch Tutorials for Everyone
Stars: ✭ 1,014 (+875%)
Mutual labels: datasets
Gopup
数据接口:百度、谷歌、头条、微博指数,宏观数据,利率数据,货币汇率,千里马、独角兽公司,新闻联播文字稿,影视票房数据,高校名单,疫情数据…
Stars: ✭ 1,229 (+1081.73%)
Mutual labels: datasets
Doppelganger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
Stars: ✭ 97 (-6.73%)
Mutual labels: datasets
Photogrammetry datasets
Collection of 250+ datasets for photogrammetry
Stars: ✭ 76 (-26.92%)
Mutual labels: datasets
French Sentiment Analysis Dataset
A collection of over 1.5 Million tweets data translated to French, with their sentiment.
Stars: ✭ 35 (-66.35%)
Mutual labels: datasets
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (+1225%)
Mutual labels: datasets
Atis dataset
The ATIS (Airline Travel Information System) Dataset
Stars: ✭ 81 (-22.12%)
Mutual labels: datasets
Persian Swear Words
دیتاست کلمات نامناسب و بد فارسی برای فیلتر کردن متن ها
Stars: ✭ 95 (-8.65%)
Mutual labels: datasets
Coco Annotator
✏️ Web-based image segmentation tool for object detection, localization, and keypoints
Stars: ✭ 1,138 (+994.23%)
Mutual labels: datasets
Crossweigh
CrossWeigh: Training Named Entity Tagger from Imperfect Annotations
Stars: ✭ 91 (-12.5%)
Mutual labels: datasets
Awesome Earth Artificial Intelligence
A curated list of Earth Science's Artificial Intelligence (AI) tutorials, notebooks, software, datasets, courses, books, video lectures and papers. Contributions most welcome.
Stars: ✭ 44 (-57.69%)
Mutual labels: datasets
Exposure correction
Reference code for the paper "Learning Multi-Scale Photo Exposure Correction", CVPR 2021.
Stars: ✭ 98 (-5.77%)
Mutual labels: datasets
Describing a knowledge base
Code for Describing a Knowledge Base
Stars: ✭ 42 (-59.62%)
Mutual labels: datasets
Wb srgb
White balance camera-rendered sRGB images (CVPR 2019) [Matlab & Python]
Stars: ✭ 101 (-2.88%)
Mutual labels: datasets
Transitland Datastore
Transitland's centralized web service API for both querying and editing aggregated transit data from around the world
Stars: ✭ 101 (-2.88%)
Mutual labels: datasets
Nottingham Dataset
Cleaned version of the Nottingham dataset
Stars: ✭ 94 (-9.62%)
Mutual labels: datasets
NLU datasets with task-oriented dialogue
Datasets of natural language understanding and dialogue state tracking for task-oriented dialogue, which can be used in research. There are some other survey of datasets in respective of diaogue system, like AtmaHou's Task-Oriented-Dialogue-Dataset-Survey (I am one of the contributors). But we focus on how to build a semantic parser for spoken dialogue system.
If you want to know more about NLU of task-oriented dialogue, please see recommended papers.
There is an implementation of joint training of slot filling and intent detection for SLU, which is evaluated on ATIS and SNIPS datasets.
Table of Contents
Introduction
Items | description | example |
---|---|---|
NLU | Natural Language Understanding, which should contains text classification, sequence labelling and semantic parsing tasks. | |
DST | Dialogue State Tracking | DSTC 2 |
domain | dialogue domain | movie, music, flight, restaurant, ... |
intent | It an abstract meaning which always refers to a sentence or sub-sentence. | The intent of "show me a movie named Titanic" is "find_movie" |
slot | It is attribute or key, which should have a value. | "show me a movie named Titanic" has a slot-value pair "movie_name = Titanic" |
act type | a general speech action | inform, deny, confirm, request, hello, bye, ... |
dialogue act | act_type(slot=value,...), https://github.com/matthen/dstc/blob/master/handbook.pdf | inform(movie_name = Titanic), request(price), ... |
- Intent Detection or intent classification: sentence classification task
- Slot Tagging: sequence labelling task
- Slot Filling: It equals to slot tagging if all values of slots can be aligned into input sentence. Otherwise, the value of slot should be predicted in a classification or generation way.
Datasets with single turn (not a dialogue)
dataset | domain | semantic annotation | tasks | url |
---|---|---|---|---|
ATIS | book flight | intent, slot | Intent classification, slot tagging | https://github.com/yvchen/JointSLU |
MIT corpus | Restaurant & Movie | slot | slot tagging | https://groups.csail.mit.edu/sls/downloads/ |
SNIPS | Playlist, Restaurant, Weather, Music, RateBook, etc. | intent, slot | Intent classification, slot tagging | https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines |
facebook TOP semantic parsing | navigation and event | hierarchical intent, slot | constituency parsing | http://fb.me/semanticparsingdialog, https://arxiv.org/abs/1810.07942 |
Facebook Multilingual Task Oriented Dataset | ALARM, REMINDER, and WEATHER | intent,slot | Intent classification, slot tagging | https://download.pytorch.org/data/multilingual_task_oriented_dialog_slotfilling.zip |
snips_slu_data_v1.0 | SmartLights, SmartSpeaker | intent,slot | Intent classification, slot tagging | https://github.com/snipsco/spoken-language-understanding-research-datasets |
SMP2017-ECDT (in Chinese) | flight, hotel, Chit-chat | intent | Intent classification | http://ir.hit.edu.cn/SMP2017-ECDT, https://github.com/HITlilingzhi/SMP2017ECDT-DATA |
E-commerce Shopping Assistant (ECSA) (in Chinese) | E-commerce Shopping | slot | slot tagging | https://github.com/pangolulu/DCMTL |
Clinc Intent Detection | Banking, Work, Meta, Auto, Travel, Home, Utility, Kitchen, Small Talk, Credit Cards | intent | Intent classification and out-of-scope detection | https://www.aclweb.org/anthology/attachments/D19-1131.Attachment.zip |
FewJoint (in Chinese) | Many domains for few-shot learning | intent, slot | Intent classification, slot tagging | Dataset; Baseline |
Datasets with multiple turns (dialogue with context)
dataset | #domains | cross_domains | semantic annotation | NLU/DST tasks | url |
---|---|---|---|---|---|
cam DSTC 2&3 | 2 | No | dialogue act | NLU (slot filling), DST (slot-value pairs) | https://github.com/matthen/dstc |
DSTC 4 | ~5 | Yes | speech action, slot | NLU (slot tagging), DST (slot-value pairs) | (challenge participants only) http://www.colips.org/workshop/dstc4/ |
google Sim-R/Sim-M/Sim-gen | 3 | No | act type, slot | NLU (slot tagging), DST (slot-value pairs) | https://github.com/google-research-datasets/simulated-dialogue |
cam MultiWOZ 2.0/2.1 | 5 | yes | multi-domains, slot-value pairs | DST (slot-value pairs) | http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/ |
maluuba Frames | 1 | No | intent, dialogue act | NLU (intent classification, slot tagging), DST (slot-value pairs) | https://datasets.maluuba.com/Frames/dl |
Microsoft Dialogue Challenge | 3 | No | dialogue act | NLU (slot tagging) | https://github.com/xiul-msr/e2e_dialog_challenge |
dstc8-schema-guided-dialogue | 17 | Yes | multi-domains, slot-value pairs, request-slots | DST | https://github.com/google-research-datasets/dstc8-schema-guided-dialogue |
MultiDoGo | 6 | Yes | over 81K dialogues harvested across six domains | NLU, DST | https://github.com/awslabs/multi-domain-goal-oriented-dialogues-dataset |
Taskmaster-1/2 | 6+7 | No | 13,215 + 17,289 task-based dialogs comprising multiple domains | NLU/DST | https://github.com/google-research-datasets/Taskmaster |
CrossWOZ(In Chinese) | 5 | Yes | 5,012 task-based dialogs comprising five domains | NLU/DST | https://github.com/google-research-datasets/Taskmaster |
Details
More information about each dataset.
ATIS
- single turn;
- input sentences: natural language;
- data size (single domain of "flight information searching"):
- training set: 4978 utterances;
- test set: 893 utterances;
- semantic annotation: intent (sentence class), slot (sequence labelling)
- intent number: 18
- slot number: 83
- Download: https://github.com/yvchen/JointSLU
MIT corpus
- single turn;
- input sentences: natural language;
- data size:
- MIT_Restaurant domain:
- training set: 7660 utterances;
- test set: 1521 utterances;
- MIT_Movie domain (simple query):
- training set: 9775 utterances;
- test set: 2443 utterances;
- MIT_Movie domain (complex query):
- training set: 7816 utterances;
- test set: 1953 utterances;
- MIT_Restaurant domain:
- semantic annotation: slot (sequence labelling)
- Download: https://groups.csail.mit.edu/sls/downloads
SNIPS
- single turn;
- input sentences: natural language;
- data size:
- 7 intents: each has more than 2000 queries.
- semantic annotation: intent (sentence class), slot (sequence labelling)
- Download: https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines
TOP semantic parsing
- single turn;
- input sentences: natural language;
- data size:
- training set: 35741 queries
- test set: 9042 queries
- semantic annotation: hierarchical intents, slot (it is a tree)
- intent number: 25
- slot number: 36
- Download: http://fb.me/semanticparsingdialog
SMP2017-ECDT (in Chinese)
- single turn;
- input sentences: natural language;
- data size:
- http://ir.hit.edu.cn/SMP2017-ECDT
- training set: 2299 queries
- development set: 770 queries
- test set: 666 queries
- semantic annotation: intent
- intent number: 31
- Download: https://github.com/HITlilingzhi/SMP2017ECDT-DATA
DSTC 2&3
- multiple turns: human-machine dialogues;
- input sentences:
- transcription by human;
- ASR output: n-best, word confusion network;
- data size:
- DSTC 2 (Restaurant Information Domain): source domain
- training set: about 2k dialogues;
- test set: about 1k dialogues;
- DSTC 3 (Tourist Information Domain): extented domain
- seed data: about 10 dialogues;
- test set: about 2k dialogues;
- DSTC 2 (Restaurant Information Domain): source domain
- semantic annotation: dialogue act
- DSTC 2: 8 slots;
- DSTC 3: 13 slots;
- Download: https://github.com/matthen/dstc
DSTC 4
- multiple turns: human-human dialogues;
- input sentences: natural language, transcription by human;
- data size:
- This data is about touristic information for Singapore collected from Skype calls.
- 35 dialogs sum up to 31,034 utterances and 273,580 words
- semantic annotation: speech action, slot, dialogue state (slot-value pairs) in sub-dialogue level
- Download: challenge participants only, http://www.colips.org/workshop/dstc4/
google Sim-R/Sim-M/Sim-gen
- multiple turns: conversations between an agent and a simulated user;
- input sentences: natural language;
- data size:
Dataset | Slots | Train | Dev | Test |
---|---|---|---|---|
Sim-R (Restaurant) | price_range, location, restaurant_name, category, num_people, date, time |
1116 | 349 | 775 |
Sim-M (Movie) | theatre_name, movie, date, time, num_people |
384 | 120 | 264 |
Sim-GEN (Movie) | theatre_name, movie, date, time, num_people |
100K | 10K | 10K |
- semantic annotation: slot
- Download: https://github.com/google-research-datasets/simulated-dialogue
cam MultiWOZ 2.0/2.1
- multiple turns: human-human dialogues collected in the way of WOZ (Wizard-of-Oz);
- input sentences: natural language;
- data size: There are 3,406 single-domain dialogues that include booking if the domain allows for that and 7,032 multi-domain dialogues consisting of at least 2 up to 5 domains.
- semantic annotation: dialogue state (slot-value pairs)
- Download: http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/
maluuba Frames
- multiple turns: human-human dialogues collected in the way of WOZ (Wizard-of-Oz);
- input sentences: natural language;
- data size:
- It is about travel.
- 1369 dialogues, 19986 turns;
- http://www.aclweb.org/anthology/W17-5526
- semantic annotation: intent, dialogue act
- tasks: NLU (intent classification, slot tagging), DST (slot-value pairs)
- Download: https://datasets.maluuba.com/Frames/dl
Microsoft Dialogue Challenge
- multiple turns:
- human-human dialogues collected via Amazon Mechanical Turk;
- Built-in user simulators are provided;
- input sentences: natural language;
- data size:
Task | Intents | Slots | Dialogues |
---|---|---|---|
Movie-Ticket Booking | 11 | 29 | 2890 |
Restaurant Reservation | 11 | 30 | 4103 |
Taxi Ordering | 11 | 29 | 3094 |
- semantic annotation: dialogue act
- tasks: NLU (slot tagging)
- Download: https://github.com/xiul-msr/e2e_dialog_challenge
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].