All Projects → wxjiao → Data-Rejuvenation

wxjiao / Data-Rejuvenation

Licence: other
Implementation of our paper "Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation" in EMNLP-2020.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Cuda
1817 projects
C++
36643 projects - #6 most used programming language
cython
566 projects
perl
6916 projects

Projects that are alternatives of or similar to Data-Rejuvenation

Tensorflow Shakespeare
Neural machine translation between the writings of Shakespeare and modern English using TensorFlow
Stars: ✭ 244 (+1255.56%)
Mutual labels:  neural-machine-translation
Neural-Machine-Translation
Several basic neural machine translation models implemented by PyTorch & TensorFlow
Stars: ✭ 29 (+61.11%)
Mutual labels:  neural-machine-translation
dynmt-py
Neural machine translation implementation using dynet's python bindings
Stars: ✭ 17 (-5.56%)
Mutual labels:  neural-machine-translation
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+18888.89%)
Mutual labels:  neural-machine-translation
DCGCN
Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning (authors' MXNet implementation for the TACL19 paper)
Stars: ✭ 73 (+305.56%)
Mutual labels:  neural-machine-translation
2018-dlsl
UPC Deep Learning for Speech and Language 2018
Stars: ✭ 18 (+0%)
Mutual labels:  neural-machine-translation
Opennmt
Open Source Neural Machine Translation in Torch (deprecated)
Stars: ✭ 2,339 (+12894.44%)
Mutual labels:  neural-machine-translation
minimal-nmt
A minimal nmt example to serve as an seq2seq+attention reference.
Stars: ✭ 36 (+100%)
Mutual labels:  neural-machine-translation
TS3000 TheChatBOT
Its a social networking chat-bot trained on Reddit dataset . It supports open bounded queries developed on the concept of Neural Machine Translation. Beware of its being sarcastic just like its creator 😝 BDW it uses Pytorch framework and Python3.
Stars: ✭ 20 (+11.11%)
Mutual labels:  neural-machine-translation
sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 26 (+44.44%)
Mutual labels:  neural-machine-translation
vat nmt
Implementation of "Effective Adversarial Regularization for Neural Machine Translation", ACL 2019
Stars: ✭ 22 (+22.22%)
Mutual labels:  neural-machine-translation
bergamot-translator
Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
Stars: ✭ 181 (+905.56%)
Mutual labels:  neural-machine-translation
parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Stars: ✭ 35 (+94.44%)
Mutual labels:  neural-machine-translation
Good Papers
I try my best to keep updated cutting-edge knowledge in Machine Learning/Deep Learning and Natural Language Processing. These are my notes on some good papers
Stars: ✭ 248 (+1277.78%)
Mutual labels:  neural-machine-translation
zero
Zero -- A neural machine translation system
Stars: ✭ 121 (+572.22%)
Mutual labels:  neural-machine-translation
Modernmt
Neural Adaptive Machine Translation that adapts to context and learns from corrections.
Stars: ✭ 231 (+1183.33%)
Mutual labels:  neural-machine-translation
Word-Level-Eng-Mar-NMT
Translating English sentences to Marathi using Neural Machine Translation
Stars: ✭ 37 (+105.56%)
Mutual labels:  neural-machine-translation
ABD-NMT
Code for "Asynchronous bidirectional decoding for neural machine translation" (AAAI, 2018)
Stars: ✭ 32 (+77.78%)
Mutual labels:  neural-machine-translation
NiuTrans.NMT
A Fast Neural Machine Translation System. It is developed in C++ and resorts to NiuTensor for fast tensor APIs.
Stars: ✭ 112 (+522.22%)
Mutual labels:  neural-machine-translation
MT-Preparation
Machine Translation (MT) Preparation Scripts
Stars: ✭ 15 (-16.67%)
Mutual labels:  neural-machine-translation

Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation

Implementation of our paper "Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation" to appear in EMNLP 2020. [paper]

🔥NEWS!🔥: Try Data Rejuvenation on WMT'19/20 datasets. You will be surprised!

Model newstest'19 newstest'20
Transformer-Big 41.1 33.7
+ Data Rejuvenation 43.0 35.5

Results: Train on WMT'19 En-De training set (~36.8M pairs), validate on newstest'18, test on newstest'19/20.

Brief Introduction

Large-scale training datasets lie at the core of the recent success of neural machine translation (NMT) models. However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. In this work, we explore to identify the inactive training examples which contribute less to the model performance, and show that the existence of inactive examples depends on the data distribution. We further introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. The proposed framework consists of three phases. First, we train an identification model on the original training data, and use it to distinguish inactive examples and active examples by their sentence-level output probabilities. Then, we train a rejuvenation model on the active examples, which is used to re-label the inactive examples with forward-translation. Finally, the rejuvenated examples and the active examples are combined to train the final NMT model. Experimental results on WMT14 English-German and English-French datasets show that the proposed data rejuvenation consistently and significantly improves performance for several strong NMT models. Extensive analyses reveal that our approach stabilizes and accelerates the training process of NMT models, resulting in final models with better generalization capability.

Figure 1: The framework of Data Rejuvenation.

Code Base

This implementation is based on fairseq(v0.9.0), with customized modification of scripts.

To start, you need to clone this repo and install fairseq firstly. Use the following pip command in fairseq/:

pip install --editable .

Additional Functionalities:

  • Transformer-based LSTM;
  • Force decoding: force_decode.py;
  • Identification: identify_split.py;

Pipeline

Take the Transformer-Base model and WMT14 En-De dataset as an example.

Identification

  1. Create four folders in fairseq/.

    mkdir dataset
    mkdir data-bin
    mkdir checkpoints
    mkdir results
    

    These four folders are used as below:

    • fairseq/dataset/: Save raw dataset with BPE.
      wmt14_en_de_base/train.en
      wmt14_en_de_base/train.de
      wmt14_en_de_base/valid.en
      wmt14_en_de_base/valid.de
      wmt14_en_de_base/test.en
      wmt14_en_de_base/test.de
      
    • fairseq/data-bin/: Save the binarized data after pre-processing.
    • fairseq/checkpoints/: Save the checkpoints of models during training.
    • fairseq/results/: Save the output results, including training log, inference output, token-wise probability, etc.
  2. Train an identification NMT model and obtain the token-wise prediction probability.

    • Train the NMT model on full training data of WMT14 En-De:
      sh sh_train.sh
      
    • Check the best model:
      fairseq/checkpoints/wmt14_en_de_base/checkpoint_best.pt
      
    • Force-decode the full training data:
      sh sh_forcedecode.sh
      
    • Check the token-wise probability:
      fairseq/results/wmt14_en_de_base/sample_status/status_train_[BestStep].txt
      
  3. Compute the sentence-level probability and split inactive examples and active examples.

    • Identify and split:
      python identify_split.py
      
    • Check the inactive examples:
      fairseq/dataset/wmt14_en_de_base_identified/inactive.en
      fairseq/dataset/wmt14_en_de_base_identified/inactive.de
      fairseq/dataset/wmt14_en_de_base_identified/active.en
      fairseq/dataset/wmt14_en_de_base_identified/active.de
      

Rejuvenation

  1. Train a rejuvenation NMT model and generate over the inactive samples.
    • Train the NMT model as normal but on the active examples:
      sh sh_train.sh
      
    • Check the best model:
      fairseq/checkpoints/wmt14_en_de_base_active/checkpoint_best.pt
      
    • Generate over the inactive examples (w/o --remove-bpe):
      sh sh_generate_extra.sh
      
    • Check the rejuvenated examples:
      fairseq/results/wmt14_en_de_base_active/inactive/source.txt
      fairseq/results/wmt14_en_de_base_active/inactive/target.txt
      fairseq/results/wmt14_en_de_base_active/inactive/decoding.txt
      

🌟NOTE🌟: A strong identification NMT models can take over the job of the rejuvenation NMT model, thus reducing the effort for training a new model. For example, the large-batch configured Transformer-Big and Dynamic-Conv models.

Final NMT Model

  1. Train a final NMT model from scratch.
    • Train the NMT model on the combination of active examples and rejuvenated examples:
      sh sh_train.sh
      
    • Check the best model:
      fairseq/checkpoints/wmt14_en_de_base_rejuvenated/checkpoint_best.pt
      
    • Evaluate on the test set:
      sh sh_generate.sh
      

Reference Performance

We evaluate the proposed Data Rejuvenation approach over various SOTA architectures and two language pairs. Clearly, our data rejuvenation consistently and significantly improves translation performance in all cases, demonstrating the effectiveness and universality of the proposed data rejuvenation approach. It’s worth noting that our approach achieves significant improvements without introducing any additional data and model modification.

Table 1: Evaluation of translation performance across model architectures and language pairs.

Public Impact

Citation

Please kindly cite our paper if you find it helpful:

@inproceedings{jiao2020data,
  title     = {Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation}, 
  author    = {Wenxiang Jiao and Xing Wang and Shilin He and Irwin King and Michael R. Lyu and Zhaopeng Tu},
  booktitle = {EMNLP},
  year      = {2020}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].