All Projects → zll17 → TV4Dialog

zll17 / TV4Dialog

Licence: other
No description or website provided.

Programming Languages

SRecode Template
9 projects

Projects that are alternatives of or similar to TV4Dialog

Cluecorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Stars: ✭ 278 (+742.42%)
Mutual labels:  corpus, chinese
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+6300%)
Mutual labels:  corpus, chinese
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (+1393.94%)
Mutual labels:  corpus, chinese
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+1048.48%)
Mutual labels:  corpus, chinese
Dialogue-Corpus
No description or website provided.
Stars: ✭ 27 (-18.18%)
Mutual labels:  dialogue, corpus
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (+184.85%)
Mutual labels:  corpus, chinese
Datasets
Poetry-related datasets developed by THUAIPoet (Jiuge) group.
Stars: ✭ 111 (+236.36%)
Mutual labels:  corpus, chinese
CLUEmotionAnalysis2020
CLUE Emotion Analysis Dataset 细粒度情感分析数据集
Stars: ✭ 3 (-90.91%)
Mutual labels:  corpus, chinese
dialogue-datasets
collect the open dialog corpus and some useful data processing utils.
Stars: ✭ 24 (-27.27%)
Mutual labels:  dialogue, corpus
Weibo terminater
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Stars: ✭ 2,295 (+6854.55%)
Mutual labels:  corpus, chinese
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+20069.7%)
Mutual labels:  corpus, chinese
Chatbot-Training-Corpus
总结了一些可以用作聊天机器人训练实作的文字语聊,包含中英文不同语言
Stars: ✭ 117 (+254.55%)
Mutual labels:  dialogue, corpus
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+7248.48%)
Mutual labels:  corpus, chinese
Probabilistic-RNN-DA-Classifier
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Stars: ✭ 22 (-33.33%)
Mutual labels:  dialogue, corpus
open2ch-dialogue-corpus
おーぷん2ちゃんねるをクロールして作成した対話コーパス
Stars: ✭ 65 (+96.97%)
Mutual labels:  dialogue, corpus
djinni
djinni中文文档,一个根据djinni写成的demo(ios),解决了macOS Sierra 10.12环境下无法build的问题
Stars: ✭ 52 (+57.58%)
Mutual labels:  chinese
unihandecode
unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities
Stars: ✭ 71 (+115.15%)
Mutual labels:  chinese
thaigov-corpus
โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-42.42%)
Mutual labels:  corpus
ADEM
TOWARDS AN AUTOMATIC TURING TEST: LEARNING TO EVALUATE DIALOGUE RESPONSES
Stars: ✭ 25 (-24.24%)
Mutual labels:  dialogue
chinese-novel
📙 Chinese novel database 最全的中国古典小说数据库。
Stars: ✭ 131 (+296.97%)
Mutual labels:  chinese

TV4Dialog Corpus

By Leilan Zhang

TV4Dialog is a multi-turn Chinese and English dialogue corpus, which is constructed based on scripts and subtitles of 4 TV series: Castle, Friends, House, TBBT. This corpus is suited to the fields of dialogue generation, dialogue analysis and machine translation.

License

TV4Dialog is a part of contribution of our paper Automatically Annotate TV Series Subtitles for Dialogue Corpus Construction.

The data in this repository is provided under the license CC BY 2.0. Please cite the following paper if you use the data:

@inproceedings{zhang2019,
  title={Automatically Annotate TV Series Subtitles for Dialogue Corpus Construction},
  author={Leilan Zhang, Qiang Zhou},
  year={2019},
  publisher = {{APSIPA} Press},
  address={Lanzhou, Gansu, China}
}

How TV4Dialog was made

We first collected the English scripts and Chinese-English subtitles of the 4 TV series from the Internet. The scripts were then parsed to XML format with extracted elements like scenes, speakers and utterances.

Using the methods proposed in our paper, we aligned the utterances in scripts to the subtitle lines and annotated the subtitles with speaker tags. According to those annotated tags, we merged the continuous subtitle lines belong to the same speaker to a single utterance.

Details of TV4Dialog

This corpus is composed of 26w utterances both in Chinese and English. It provides both the scripts (en) and the subtitle (en & zh), basic statistics are list below:

Directory xmlScript stores the parsed scripts of the 4 TV series in XML format (in English).

Directory withSpkr stores the subtitles annotated with speaker tags and uid tags (en & zh parallel).

Directory extracted stores the merged utterances extracted from the annotated subtitles (in Chinese).

Directory rawScript stores the raw scripts of the TV series.

Acknowledgments

We would like to thank website assrt.net for the help of providing the raw subtitle files.

We also thank websites Castle, Friends, House and TBBT for their work of collecting and sharing of the raw scripts.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].