longyuewangdcu / tvsub

Licence: other

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

Projects that are alternatives of or similar to tvsub

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+5180%)

Mutual labels: machine-translation, corpus

BSD

The Business Scene Dialogue corpus

Stars: ✭ 51 (+27.5%)

Mutual labels: machine-translation, corpus

Bleualign

Machine-Translation-based sentence alignment tool for parallel text

Stars: ✭ 199 (+397.5%)

Mutual labels: machine-translation

DANeS

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Stars: ✭ 64 (+60%)

Mutual labels: corpus

megs

A merged version of multiple open-source German speech datasets.

Stars: ✭ 21 (-47.5%)

Mutual labels: corpus

Hardware Aware Transformers

[ACL 2020] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Stars: ✭ 206 (+415%)

Mutual labels: machine-translation

sb-nmt

Code for Synchronous Bidirectional Neural Machine Translation (SB-NMT)

Stars: ✭ 66 (+65%)

Mutual labels: machine-translation

Texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Stars: ✭ 2,236 (+5490%)

Mutual labels: machine-translation

Speech-Corpus-Collection

A Collection of Speech Corpus for ASR and TTS

Stars: ✭ 113 (+182.5%)

Mutual labels: corpus

transformer

Build English-Vietnamese machine translation with ProtonX Transformer. :D

Stars: ✭ 41 (+2.5%)

Mutual labels: machine-translation

Probabilistic-RNN-DA-Classifier

Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model

Stars: ✭ 22 (-45%)

Mutual labels: corpus

Dialogue-Corpus

No description or website provided.

Stars: ✭ 27 (-32.5%)

Mutual labels: corpus

Opennmt

Open Source Neural Machine Translation in Torch (deprecated)

Stars: ✭ 2,339 (+5747.5%)

Mutual labels: machine-translation

german-nouns

A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.

Stars: ✭ 101 (+152.5%)

Mutual labels: corpus

Attention Mechanisms

Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.

Stars: ✭ 203 (+407.5%)

Mutual labels: machine-translation

osdg-tool

OSDG is an open-source tool that maps and connects activities to the UN Sustainable Development Goals (SDGs) by identifying SDG-relevant content in any text. The tool is available online at www.osdg.ai. API access available for research purposes.

Stars: ✭ 22 (-45%)

Mutual labels: machine-translation

Lingvo

Stars: ✭ 2,361 (+5802.5%)

Mutual labels: machine-translation

ibleu

A visual and interactive scoring environment for machine translation systems.

Stars: ✭ 27 (-32.5%)

Mutual labels: machine-translation

apertium-apy

📦 Apertium HTTP Server in Python

Stars: ✭ 29 (-27.5%)

Mutual labels: machine-translation

bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.

Stars: ✭ 181 (+352.5%)

Mutual labels: machine-translation

View All Similar Projects ➔

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

The data are used in our AAAI-18 paper Translating Pro-Drop Languages with Reconstruction Models.

The corpus is designed to be dialogue domain and parallel data with larger-context information for research purpose. More than two million sentence pairs were extracted from the subtitles of television episodes.

Within the corpus, sentences are generally short and the Chinese side contains many examples of dropped pronouns (DPs). Therefore, the corpus was initially designed for pro-drop language translation task, and the related paper (Translating Pro-Drop Languages with Reconstruction Models) was accepted by AAAI 2018 conference.

Actually, the corpus can be also used for various translation tasks such as larger-context MT (Exploiting Cross-Sentence Context for Neural Machine Translation; Learning to Remember Translation History with a Continuous Cache).

Novelty

The differences to other existing bilignaul subtitle corpora are as follows:

We only extract subtitles of television episodes instead of movie ones. The vocabulary in movies is more sparsity than that in TV series. To aviod the long-tail problems, we use TV series data for MT tasks.
We pre-processed the extracted data using a number of in-house scripts including sentence boundary detection and bilingual sentence alignment etc. Thus, we obtained a more cleaner, better-aligned, high-quality corpus.
We keep the larger-context information instead of disordering sentences. Thus, you can mine useful discourse information from the previous or following sentences for MT.
We randomly select two complete television episodes as the tuning set, and another two episodes as the test set. We manually create multiple references for them.
In order to re-implement our AAAI-18 paper (Translating Pro-Drop Languages with Reconstruction Models), we also released the +DP corpus, in which the Chinese sentences are automatically labelled with DPs using alignment information.

Getting Started

Plsease clone the repo, because we may update new version of data in the future.

git clone https://github.com/longyuewangdcu/tvsub.git

The folder stucture is as follows:

++ tvsub (root)
++++ data
++++++ orignal corpus
++++++++ train
++++++++ dev
++++++++ test
++++++ preprocessed corpus
++++++++ train
++++++++ dev
++++++++ test

Data Details

The following table lists the statistics of the corpus.

Authors

Longyue Wang - crawling and pre-processing data
Zhaopeng Tu - dev and test sets

Publications

If you use the data, please cite the following paper:

Longyue Wang, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, Qun Liu. (2018). "Translating Pro-Drop Languages with Reconstruction Models", Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018).

@inproceedings{wang2018aaai,
  title={Translating Pro-Drop Languages with Reconstruction Models},
  author={Wang, Longyue and Tu, Zhaopeng and Shi, Shuming and Zhang, Tong and Graham, Yvette and Liu, Qun},
  year={2018},
  publisher = {{AAAI} Press},
  booktitle={Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence},
  address={New Orleans, Louisiana, USA},
  pages={1--9}
}

The data were crawled from the subtitle websites: http://assrt.net and http://www.zimuzu.tv. If you use the TVsub corpus, please add these links (http://www.zimuzu.tv and http://assrt.net) to your website and publications!

License

This data is only used for research purpose.

Plsease read the License Agreement before you use the data.

Acknowledgments

The released data is part of contribution of our AAAI-18 paper.

The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Work was done when Longyue Wang was interning at Tencent AI Lab.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

longyuewangdcu / tvsub

Labels

Projects that are alternatives of or similar to tvsub

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

The data are used in our AAAI-18 paper Translating Pro-Drop Languages with Reconstruction Models.

Novelty

Getting Started

Data Details

Authors

Publications

License

Acknowledgments