All Projects → duguyue100 → wmt-en2wubi

duguyue100 / wmt-en2wubi

Licence: MIT license
Code and data for Character-level Chinese-English Translation through ASCII Encoding

Programming Languages

perl
6916 projects
python
139335 projects - #7 most used programming language
emacs lisp
2029 projects
smalltalk
420 projects
ruby
36898 projects - #4 most used programming language
NewLisp
63 projects

wmt-en2wubi

Code and data for Character-level Chinese-English Translation through ASCII Encoding.

Cite the paper

@InProceedings{en2wubi,
  author    = {Nikolov, Nikola  and  Hu, Yuhuang  and  Tan, Mi Xue  and  Hahnloser, Richard H.R.},
  title     = {Character-level Chinese-English Translation through ASCII Encoding},
  booktitle = {Proceedings of the Third Conference on Machine Translation},
  month     = {October},
  year      = {2018},
  address   = {Belgium, Brussels},
  publisher = {Association for Computational Linguistics},
  pages     = {10--16},
  url       = {http://www.aclweb.org/anthology/W18-64002}
}

Training/Evaluation Data and Results

The data used to produce the paper and model results is available here.

Converting Chinese to Wubi

To convert your data from Chinese to Wubi, follow the instructions in the en2wubi package.

Instructions for reproducing the results

Word- and subword- level

Follow the instructions in the Fairseq library for preprocessing, training and evaluation. To train the same LSTM model that we use the paper, pass --arch lstm to train.py; for the FConv model pass --arch fconv_iwslt_de_en.

On the subword-level, you need to additionally learn and apply subword segmentation rules on the dataset. We use the subword-nmt library for subword segmentation.

Character-level

Follow the instructions in this repository for preprocessing and train a bilingual char2char model using char2char/train_bi_char2char.py.

Evaluation

To compute BLEU, download and run multi-bleu.perl as:

perl multi-bleu.perl reference.txt < model_output.txt

When evaluating en2wb vs. en2cn, you can use our scripts to convert the Chinese results to Wubi before computing BLEU, to make the scores more comparable.

Contacts

Nikola I. Nikolov and Yuhuang Hu

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].