All Projects → longyuewangdcu → tvsub

longyuewangdcu / tvsub

Licence: other
TVsub: DCU-Tencent Chinese-English Dialogue Corpus

Projects that are alternatives of or similar to tvsub

Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+5180%)
Mutual labels:  machine-translation, corpus
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (+27.5%)
Mutual labels:  machine-translation, corpus
Bleualign
Machine-Translation-based sentence alignment tool for parallel text
Stars: ✭ 199 (+397.5%)
Mutual labels:  machine-translation
DANeS
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Stars: ✭ 64 (+60%)
Mutual labels:  corpus
megs
A merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-47.5%)
Mutual labels:  corpus
Hardware Aware Transformers
[ACL 2020] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Stars: ✭ 206 (+415%)
Mutual labels:  machine-translation
sb-nmt
Code for Synchronous Bidirectional Neural Machine Translation (SB-NMT)
Stars: ✭ 66 (+65%)
Mutual labels:  machine-translation
Texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 2,236 (+5490%)
Mutual labels:  machine-translation
Speech-Corpus-Collection
A Collection of Speech Corpus for ASR and TTS
Stars: ✭ 113 (+182.5%)
Mutual labels:  corpus
transformer
Build English-Vietnamese machine translation with ProtonX Transformer. :D
Stars: ✭ 41 (+2.5%)
Mutual labels:  machine-translation
Probabilistic-RNN-DA-Classifier
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Stars: ✭ 22 (-45%)
Mutual labels:  corpus
Dialogue-Corpus
No description or website provided.
Stars: ✭ 27 (-32.5%)
Mutual labels:  corpus
Opennmt
Open Source Neural Machine Translation in Torch (deprecated)
Stars: ✭ 2,339 (+5747.5%)
Mutual labels:  machine-translation
german-nouns
A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Stars: ✭ 101 (+152.5%)
Mutual labels:  corpus
Attention Mechanisms
Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.
Stars: ✭ 203 (+407.5%)
Mutual labels:  machine-translation
osdg-tool
OSDG is an open-source tool that maps and connects activities to the UN Sustainable Development Goals (SDGs) by identifying SDG-relevant content in any text. The tool is available online at www.osdg.ai. API access available for research purposes.
Stars: ✭ 22 (-45%)
Mutual labels:  machine-translation
Lingvo
Lingvo
Stars: ✭ 2,361 (+5802.5%)
Mutual labels:  machine-translation
ibleu
A visual and interactive scoring environment for machine translation systems.
Stars: ✭ 27 (-32.5%)
Mutual labels:  machine-translation
apertium-apy
📦 Apertium HTTP Server in Python
Stars: ✭ 29 (-27.5%)
Mutual labels:  machine-translation
bergamot-translator
Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
Stars: ✭ 181 (+352.5%)
Mutual labels:  machine-translation

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

The data are used in our AAAI-18 paper Translating Pro-Drop Languages with Reconstruction Models.

The corpus is designed to be dialogue domain and parallel data with larger-context information for research purpose. More than two million sentence pairs were extracted from the subtitles of television episodes.

Within the corpus, sentences are generally short and the Chinese side contains many examples of dropped pronouns (DPs). Therefore, the corpus was initially designed for pro-drop language translation task, and the related paper (Translating Pro-Drop Languages with Reconstruction Models) was accepted by AAAI 2018 conference.

Actually, the corpus can be also used for various translation tasks such as larger-context MT (Exploiting Cross-Sentence Context for Neural Machine Translation; Learning to Remember Translation History with a Continuous Cache).

Novelty

The differences to other existing bilignaul subtitle corpora are as follows:

  • We only extract subtitles of television episodes instead of movie ones. The vocabulary in movies is more sparsity than that in TV series. To aviod the long-tail problems, we use TV series data for MT tasks.

  • We pre-processed the extracted data using a number of in-house scripts including sentence boundary detection and bilingual sentence alignment etc. Thus, we obtained a more cleaner, better-aligned, high-quality corpus.

  • We keep the larger-context information instead of disordering sentences. Thus, you can mine useful discourse information from the previous or following sentences for MT.

  • We randomly select two complete television episodes as the tuning set, and another two episodes as the test set. We manually create multiple references for them.

  • In order to re-implement our AAAI-18 paper (Translating Pro-Drop Languages with Reconstruction Models), we also released the +DP corpus, in which the Chinese sentences are automatically labelled with DPs using alignment information.

Getting Started

Plsease clone the repo, because we may update new version of data in the future.

git clone https://github.com/longyuewangdcu/tvsub.git

The folder stucture is as follows:

++ tvsub (root)
++++ data
++++++ orignal corpus
++++++++ train
++++++++ dev
++++++++ test
++++++ preprocessed corpus
++++++++ train
++++++++ dev
++++++++ test

Data Details

The following table lists the statistics of the corpus.

data_details

Authors

Publications

If you use the data, please cite the following paper:

Longyue Wang, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, Qun Liu. (2018). "Translating Pro-Drop Languages with Reconstruction Models", Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018).

@inproceedings{wang2018aaai,
  title={Translating Pro-Drop Languages with Reconstruction Models},
  author={Wang, Longyue and Tu, Zhaopeng and Shi, Shuming and Zhang, Tong and Graham, Yvette and Liu, Qun},
  year={2018},
  publisher = {{AAAI} Press},
  booktitle={Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence},
  address={New Orleans, Louisiana, USA},
  pages={1--9}
}

The data were crawled from the subtitle websites: http://assrt.net and http://www.zimuzu.tv. If you use the TVsub corpus, please add these links (http://www.zimuzu.tv and http://assrt.net) to your website and publications!

License

This data is only used for research purpose.

Plsease read the License Agreement before you use the data.

Acknowledgments

The released data is part of contribution of our AAAI-18 paper.

The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Work was done when Longyue Wang was interning at Tencent AI Lab.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].