tsuruoka-lab / BSD

Licence: other

The Business Scene Dialogue corpus

Projects that are alternatives of or similar to BSD

This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.

Stars: ✭ 91 (+78.43%)

Mutual labels: machine-translation, parallel-corpus, parallel-corpora

Indian ParallelCorpus

Curated list of publicly available parallel corpus for Indian Languages

Stars: ✭ 23 (-54.9%)

Mutual labels: corpus, parallel-corpus, parallel-corpora

TALPCo

TUFS Asian Language Parallel Corpus

Stars: ✭ 32 (-37.25%)

Mutual labels: japanese, english, parallel-corpus

open2ch-dialogue-corpus

おーぷん2ちゃんねるをクロールして作成した対話コーパス

Stars: ✭ 65 (+27.45%)

Mutual labels: japanese, corpus

FCH-TTS

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Stars: ✭ 154 (+201.96%)

Mutual labels: japanese, english

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+4041.18%)

Mutual labels: machine-translation, corpus

jiten

jiten - japanese android/cli/web dictionary based on jmdict/kanjidic — 日本語　辞典　和英辞典　漢英字典　和独辞典　和蘭辞典

Stars: ✭ 64 (+25.49%)

Mutual labels: japanese, english

kanji-frequency

Kanji usage frequency data collected from various sources

Stars: ✭ 92 (+80.39%)

Mutual labels: japanese, corpus

Mouse Dictionary

📘A super fast dictionary for Chrome/Firefox

Stars: ✭ 670 (+1213.73%)

Mutual labels: japanese, english

Google Ime Dictionary

日英変換・英語略語展開のための IME 追加辞書 📙 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書です

Stars: ✭ 30 (-41.18%)

Mutual labels: japanese, english

nepali-translator

Neural Machine Translation on the Nepali-English language pair

Stars: ✭ 29 (-43.14%)

Mutual labels: machine-translation, parallel-corpus

Gse

Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词

Stars: ✭ 1,695 (+3223.53%)

Mutual labels: japanese, english

KWDLC

Kyoto University Web Document Leads Corpus

Stars: ✭ 64 (+25.49%)

Mutual labels: japanese, corpus

Memorize

🚀 Japanese-English-Mongolian dictionary. It lets you find words, kanji and more quickly and easily

Stars: ✭ 72 (+41.18%)

Mutual labels: japanese, english

tvsub

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

Stars: ✭ 40 (-21.57%)

Mutual labels: machine-translation, corpus

gum

Repository for the Georgetown University Multilayer Corpus (GUM)

Stars: ✭ 71 (+39.22%)

Mutual labels: corpus

folket

Swedish–English dictionary for macOS (December 20, 2020)

Stars: ✭ 31 (-39.22%)

Mutual labels: english

MetricMT

The official code repository for MetricMT - a reward optimization method for NMT with learned metrics

Stars: ✭ 23 (-54.9%)

Mutual labels: machine-translation

malay-dataset

Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html

Stars: ✭ 189 (+270.59%)

Mutual labels: corpus

japanese-pretrained-models

Code for producing Japanese pretrained models provided by rinna Co., Ltd.

Stars: ✭ 484 (+849.02%)

Mutual labels: japanese

View All Similar Projects ➔

The Business Scene Dialogue corpus

Updates

November 10, 2021: Further fix for the speaker information.
November 2, 2021: The data are updated by fixing incorrect speaker information and some misspellings in the conversation text.

Corpus Description

The Japanese-English business conversation corpus, namely Business Scene Dialogue (BSD) corpus, was constructed in 3 steps: 1) selecting business scenes, 2) writing monolingual conversation scenarios according to the selected scenes, and 3) translating the scenarios into the other language. Half of the monolingual scenarios were written in Japanese and the other half were written in English. The whole construction process was supervised by a person who satisfies the following conditions to guarantee the conversations to be natural:

has the experience of being engaged in language learning programs, especially for business conversations
is able to smoothly communicate with others in various business scenes both in Japanese and English
has the experience of being involved in business

We provide balanced training, development and evaluation splits from BSD corpus. The documents in these sets are balanced in terms of scenes and original languages. In this repository we publicly share the full development and evaluation sets and a part of the training data set.

	Training	Development	Evaluation
Sentences	20,000	2,051	2,120
Scenarios	670	69	69

Corpus Statistics

Data Set	Scene	Scenarios	Sentences	Scenarios	Sentences
		JA-EN		EN-JA
Training	Face-to-face	122	3525	103	2986
	Phone call	68	1944	75	2175
	General chatting	61	1915	72	1883
	Meeting	56	1964	58	1787
	Training	12	562	19	463
	Presentation	6	607	18	189
	Total	325	10,000	345	10,000
Development	Face-to-face	11	319	12	314
	Phone call	6	176	7	185
	General chatting	7	223	8	248
	Meeting	7	240	7	219
	Training	1	40	1	23
	Presentation	1	31	1	33
	Total	34	997	35	1054
Evaluation	Face-to-face	12	381	11	345
	Phone call	6	163	7	212
	General chatting	7	211	8	212
	Meeting	7	228	7	229
	Training	1	38	1	30
	Presentation	1	31	1	40
	Total	34	1052	35	1068

Corpus Structure

The corpus is structured in json format consisting of documents, which consist of sentence pairs. Each sentence pair has a sentence number, speaker name in English and Japanese, text in English and Japanese, original language, scene of the scenario (tag), and title of the scenario (title).

[
    {
        "id": "190315_E001_17",
        "tag": "training",
        "title": "Training: How to do research",
        "original_language": "en",
        "conversation": [
            {
                "no": 1,
                "en_speaker": "Mr. Ben Sherman",
                "ja_speaker": "ベン シャーマンさん",
                "en_sentence": "I will be teaching you how to conduct research today.",
                "ja_sentence": "今日は調査の進め方についてトレーニングします。"
          },
            ...
	      ]
      },
	...
]

License

Our dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license.

Reference

If you use this dataset, please cite the following paper: Matīss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa (2019). "Designing the Business Conversation Corpus." In Proceedings of the 6th Workshop on Asian Translation, 2019.

@inproceedings{rikters-etal-2019-designing,
    title = "Designing the Business Conversation Corpus",
    author = "Rikters, Mat{\=\i}ss  and
      Ri, Ryokan  and
      Li, Tong  and
      Nakazawa, Toshiaki",
    booktitle = "Proceedings of the 6th Workshop on Asian Translation",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-5204",
    doi = "10.18653/v1/D19-5204",
    pages = "54--61"
}

Acknowledgements

This work was supported by "Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation", the Commissioned Research of National Institute of Information and Communications Technology (NICT), JAPAN.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

tsuruoka-lab / BSD

Labels

Projects that are alternatives of or similar to BSD

The Business Scene Dialogue corpus

Updates

Corpus Description

Corpus Statistics

Corpus Structure

License

Reference

Acknowledgements