All Projects → matbahasa → TALPCo

matbahasa / TALPCo

Licence: other
TUFS Asian Language Parallel Corpus

Programming Languages

TeX
3793 projects

Projects that are alternatives of or similar to TALPCo

FCH-TTS
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。
Stars: ✭ 154 (+381.25%)
Mutual labels:  japanese, english, korean
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (+59.38%)
Mutual labels:  japanese, english, parallel-corpus
Google Ime Dictionary
日英変換・英語略語展開のための IME 追加辞書 📙 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書です
Stars: ✭ 30 (-6.25%)
Mutual labels:  japanese, english
Owasp Masvs
The Mobile Application Security Verification Standard (MASVS) is a standard for mobile app security.
Stars: ✭ 1,030 (+3118.75%)
Mutual labels:  japanese, korean
ark-pixel-font
Open source Pan-CJK pixel font / 开源的泛中日韩像素字体
Stars: ✭ 1,767 (+5421.88%)
Mutual labels:  japanese, korean
Roenglishre
An unofficial english translation project for Korea Ragnarok Online (kRO).
Stars: ✭ 121 (+278.13%)
Mutual labels:  english, korean
Mouse Dictionary
📘A super fast dictionary for Chrome/Firefox
Stars: ✭ 670 (+1993.75%)
Mutual labels:  japanese, english
Gse
Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词
Stars: ✭ 1,695 (+5196.88%)
Mutual labels:  japanese, english
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+1631.25%)
Mutual labels:  japanese, korean
unihandecode
unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities
Stars: ✭ 71 (+121.88%)
Mutual labels:  japanese, korean
kengdic
Joe Speigle's Korean/English dictionary database
Stars: ✭ 76 (+137.5%)
Mutual labels:  english, korean
jiten
jiten - japanese android/cli/web dictionary based on jmdict/kanjidic — 日本語 辞典 和英辞典 漢英字典 和独辞典 和蘭辞典
Stars: ✭ 64 (+100%)
Mutual labels:  japanese, english
Memorize
🚀 Japanese-English-Mongolian dictionary. It lets you find words, kanji and more quickly and easily
Stars: ✭ 72 (+125%)
Mutual labels:  japanese, english
belajar-ngoding-bhs-indo
Sebuah "awesome list" daftar bahan belajar pemrograman (dan hal-hal terkait) dalam bahasa Indonesia.
Stars: ✭ 35 (+9.38%)
Mutual labels:  bahasa-indonesia, indonesian
sylbreak
Syllable segmentation tool for Myanmar language (Burmese) by Ye.
Stars: ✭ 44 (+37.5%)
Mutual labels:  myanmar, burmese
tudien
Từ điển tiếng Việt dành cho Kindle
Stars: ✭ 38 (+18.75%)
Mutual labels:  vietnamese, english
vietnamese word seperate
Seperate vietnamese using lstm
Stars: ✭ 13 (-59.37%)
Mutual labels:  vietnamese
knp
A Japanese Parser
Stars: ✭ 16 (-50%)
Mutual labels:  japanese
Zipangu
A library for compatibility about Japan.
Stars: ✭ 27 (-15.62%)
Mutual labels:  japanese
Twelveish
🕛 Twelveish - Android Wear/Wear OS Watch Face
Stars: ✭ 29 (-9.37%)
Mutual labels:  english

TUFS Asian Language Parallel Corpus (TALPCo)

Introduction

The TUFS Asian Language Parallel Corpus (TALPCo) is an open parallel corpus consisting of Japanese sentences and their translations into Korean, Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English. TALPCo is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the paper below for the details of TALPCo.

How to cite

Contents

  • data_jpn.txt Japanese (raw sentences)
  • data_jpn-token.txt Japanese (tokenized sentences)
  • data_jpn-IPSpkr.csv Japanese (interpersonal meaning annotation, speaker)
  • data_jpn-IPAddr.csv Japanese (interpersonal meaning annotation, addressee)
  • data_jpn-IPLex.csv Japanese (interpersonal meaning annotation, lexical)

  • data_kor.txt Korean (raw sentences)
  • data_kor-token.txt Korean (tokenized sentences)

  • data_myn.txt Burmese (raw sentences)
  • data_myn-token.txt Burmese (tokenized sentences)
  • data_myn-ps.txt Burmese (POS-tagged sentences)

  • data_zsm.txt Malay (raw sentences)
  • data_zsm-token.txt Malay (tokenized sentences)
  • data_zsm-MWE.txt Malay (multiword expression list)
  • data_zsm.jpn-zsm Malay (partial Japanese-Malay alignment)
  • data_zsm-IPSpkr.csv Malay (interpersonal meaning annotation, speaker)
  • data_zsm-IPAddr.csv Malay (interpersonal meaning annotation, addressee)
  • data_zsm-IPLex.csv Malay (interpersonal meaning annotation, lexical)

  • data_ind.txt Indonesian (raw sentences)
  • data_ind-token.txt Indonesian (tokenized sentences)
  • data_ind-MWE.txt Indonesian (multiword expression list)
  • data_ind.jpn-ind Indonesian (partial Japanese-Indonesian alignment)
  • data_ind-IPSpkr.csv Indonesian (interpersonal meaning annotation, speaker)
  • data_ind-IPAddr.csv Indonesian (interpersonal meaning annotation, addressee)
  • data_ind-IPLex.csv Indonesian (interpersonal meaning annotation, lexical)

  • data_tha.txt Thai (raw sentences)
  • data_tha-token.txt Thai (tokenized sentences)
  • data_tha.jpn-tha Thai (partial Japanese-Thai alignment)
  • data_tha-IPSpkr.csv Thai (interpersonal meaning annotation, speaker)
  • data_tha-IPAddr.csv Thai (interpersonal meaning annotation, addressee)
  • data_tha-IPLex.csv Thai (interpersonal meaning annotation, lexical)

  • data_vie.txt Vietnamese (raw sentences)
  • data_vie-token.txt Vietnamese (tokenized sentences)
  • data_vie-MWE.txt Vietnamese (multi-syllable expression list)
  • data_vie.jpn-vie Vietnamese (partial Japanese-Vietnamese alignment)
  • data_vie-IPSpkr.csv Vietnamese (interpersonal meaning annotation, speaker)
  • data_vie-IPAddr.csv Vietnamese (interpersonal meaning annotation, addressee)
  • data_vie-IPLex.csv Vietnamese (interpersonal meaning annotation, lexical)

  • data_eng.txt English (raw sentences)

  • readme.me (this document)

Format

All files are encoded in UTF-8 with DOS format.

Raw sentences

Sentence_ID [TAB] Sentence

1176	田中さんは 学生では ありません。
1176	Mr. Tanaka is not a student.

Tokenized sentences

Sentence_ID [LINEBREAK] token [LINEBREAK] token [LINEBREAK] <EOS>

3627
Buku
ini
mempunyai
se-
ratus
dua
puluh
muka surat
.
<EOS>

Burmese POS-tagged sentences

Sentence_ID [TAB] Sentence

  • White space: Phrasal boundary

  • Dash: Morpheme boundary

    1176 n-pr-postp n pref-v-suf-suf

Alignment

Sentence_ID [TAB] Japanese_token_index-target_language_token_index

1176	0-1 1-0 3-3 8-2 9-4

Interpersonal meaning annotation

See the second paper above and its supplement for the details of the interpersonal meaning feature system.

Speaker, Addressee

Sentence_ID, Gender, Marital status, Honour, Age, Social status, Role, Group, Formality, Number

3243,female,,,,neutral,,,,sg

Lexical

Token_index, token, Gender, Marital status, Honour, Age, Social status, Role, Group, Formality, Number

3845,,,,,,,,,,
0,Cô,female,,,elder.parents_younger_sibling,,parents_sibling.paternal,,,
1,tôi,,,,,neutral,,,,sg
2,làm việc,,,,,,,,,
3,ở,,,,,,,,,
4,cửa hàng,,,,,,,,,
5,hoa,,,,,,,,,
6,.,,,,,,,,,
<EOS>,,,,,,,,,,

Notes on tokenization

Malay/Indonesian

The Malay and Indonesian sentences were tokenized manually by Hiroki Nomoto and David Moeljadi, respectively. All clitics (i.e. -nya, -lah, -kah) were tokenized. In addition, the instances of the prefix se- were tokenized if they were cardinal numerals. Note that the suffix -nya and the non-numeral instances of se- were not tokenized. The following dictionaries were consulted when it was not immediately obvious whether a word sequence constituted a multiword expression.

Thai

The sentences were tokenized using the tokenize function of Deepcut and then checked by Sunisa Wittayapanyanon and Yuka Sato. The principle adopted for the manual correction is:

  • Tokenize a sequence consisting of two or more syllables if and only if all constituent syllables have a meaning that contributes to the meaning of the whole phrase/sentence.
    • Do not tokenize a sequence if it contains a meaningless syllable.
    • Do not tokenize a sequence if tokenizing it will change the meaning of the whole phrase/sentence.

Vietnamese

The sentences were tokenized using the word_tokenize function of the Undersea - Vietnamese NLP Project and then checked by Junta Nomura and Hiroki Nomoto. The following dictionary was consulted when it was not immediately obvious whether a syllable sequence constituted a multi-syllable expression.

  • Hoàng, Phê, ed. 2003. Từ Điển Tiếng Việt. Đà Nẵng: Nhà Xuất Bản Đà Nẵng.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].