All Projects → marziehf → DataAugmentationNMT

marziehf / DataAugmentationNMT

Licence: MIT License
Data Augmentation for Neural Machine Translation

Programming Languages

lua
6591 projects
perl
6916 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to DataAugmentationNMT

Openseq2seq
Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Stars: ✭ 1,378 (+5200%)
Mutual labels:  neural-machine-translation, language-model
Nlp Library
curated collection of papers for the nlp practitioner 📖👩‍🔬
Stars: ✭ 1,025 (+3842.31%)
Mutual labels:  neural-machine-translation, language-model
discolight
discolight is a robust, flexible and infinitely hackable library for generating image augmentations ✨
Stars: ✭ 25 (-3.85%)
Mutual labels:  augmentation
minicons
Utility for analyzing Transformer based representations of language.
Stars: ✭ 28 (+7.69%)
Mutual labels:  language-model
MinTL
MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems
Stars: ✭ 61 (+134.62%)
Mutual labels:  language-model
Data-Rejuvenation
Implementation of our paper "Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation" in EMNLP-2020.
Stars: ✭ 18 (-30.77%)
Mutual labels:  neural-machine-translation
SQUAD2.Q-Augmented-Dataset
Augmented version of SQUAD 2.0 for Questions
Stars: ✭ 31 (+19.23%)
Mutual labels:  augmentation
torch-pitch-shift
Pitch-shift audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.
Stars: ✭ 70 (+169.23%)
Mutual labels:  augmentation
SDLM-pytorch
Code accompanying EMNLP 2018 paper Language Modeling with Sparse Product of Sememe Experts
Stars: ✭ 27 (+3.85%)
Mutual labels:  language-model
RNNSearch
An implementation of attention-based neural machine translation using Pytorch
Stars: ✭ 43 (+65.38%)
Mutual labels:  neural-machine-translation
SSAN
How Does Selective Mechanism Improve Self-attention Networks?
Stars: ✭ 18 (-30.77%)
Mutual labels:  neural-machine-translation
tying-wv-and-wc
Implementation for "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling"
Stars: ✭ 39 (+50%)
Mutual labels:  language-model
gpt-j
A GPT-J API to use with python3 to generate text, blogs, code, and more
Stars: ✭ 101 (+288.46%)
Mutual labels:  language-model
transformer
Neutron: A pytorch based implementation of Transformer and its variants.
Stars: ✭ 60 (+130.77%)
Mutual labels:  neural-machine-translation
Word-Prediction-Ngram
Next Word Prediction using n-gram Probabilistic Model with various Smoothing Techniques
Stars: ✭ 25 (-3.85%)
Mutual labels:  language-model
timber-ruby
🌲 Great Ruby logging made easy.
Stars: ✭ 155 (+496.15%)
Mutual labels:  augmentation
FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Stars: ✭ 204 (+684.62%)
Mutual labels:  language-model
CodeT5
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
Stars: ✭ 390 (+1400%)
Mutual labels:  language-model
pyVHDLParser
Streaming based VHDL parser.
Stars: ✭ 51 (+96.15%)
Mutual labels:  language-model
augmenty
Augmenty is an augmentation library based on spaCy for augmenting texts.
Stars: ✭ 101 (+288.46%)
Mutual labels:  augmentation

DataAugmentationNMT

This repository includes the codes and scripts for data augmentation targeting rare words for neural machine translation proposed in our paper.

Citation

If you use this code, please cite:

@InProceedings{fadaee-bisazza-monz:2017:Short2,
  author    = {Fadaee, Marzieh  and  Bisazza, Arianna  and  Monz, Christof},
  title     = {Data Augmentation for Low-Resource Neural Machine Translation},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
  month     = {July},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {567--573},
  url       = {http://aclweb.org/anthology/P17-2090}
}

Dependencies

  • Torch7
  • nn
  • optim
  • lua-cjson
  • torch-hdf5
  • Python 2.7

Usage

Step 1: Data Preprocessing

Before training the monolingual language model in [src/trg] you'll need to preprocess the data for both forward and backward direction using preprocess.no_preset_v.py.

python src/preprocess.no_preset_v.py --train_txt ./wiki.train.txt \
--val_txt ./wiki.val.txt --test_txt ./wiki.test.txt \
--output_h5 ./data.h5 --output_json ./data.json

This will produce files data.h5 and data.json that will be passed to the training script.

Step 2: Language Model Training

After preprocessing the data you'll need to train two language models in forward and backward directions.

th src/train.lua -input_h5 data.h5 -input_json data.json \
-checkpoint_name models_rnn/cv  -vocabfreq vocab_freq.trg.txt 

th src/train.lua -input_h5 data.rev.h5 -input_json data.rev.json \
-checkpoint_name models_rnn_rev/cv  -vocabfreq vocab_freq.trg.txt

There are many more flags you can use to configure training.

The vocabfreq input is the frequency list of words in the low-resource setting that need augmentation later on using these language models. The format is:

...
change 3028
taken 3007
large 2999
again 2994
...

Step 3: Substitution Generation

After training the language models you can generate new sentences in your bitext for [src\trg]. You can run this:

th src/substitution.lua -checkpoint models_rnn/cv_xxx.t7 -start_text train.en \
-vocabfreq vocab_freq.trg.txt -sample 0 -topk 1000 -bwd 0 > train.en.subs

th src/substitution.lua -checkpoint models_rev.rnn/cv_xxx.t7 -start_text train.en.rev \
-vocabfreq vocab_freq.trg.txt -sample 0 -topk 1000 -bwd 1 > train.en.rev.subs

start_text is the side of the bitext that you are targeting for augmentation of rare words. vocabfreq is the frequency list used for detecting rare words. topk indicates the maximum number of substitutions you want to have for each position in the sentence.

Running these two codes will give you augmented corpora with a list of substitutions on one side: train.en.subs and train.en.rev.subs. In order to find substitions that best match the context, you'll need to find the intersection of these two lists:

perl ./scripts/generate_intersect.pl train.en.subs train.en.rev.subs subs.intersect

subs.intersect contains the substitutions that can be used to augment the bitext. Here's an example of the output:

information where we are successful will be published in this unk .
information{}
where{}
we{doctors:136 humans:135}
are{became:764 remained:245}
successful{}
will{}
be{}
published{interested:728 introduced:604 kept:456 performed:289 placed:615 played:535 released:477 written:790}
in{behind:932 beyond:836}
this{henry:58}
unk{}
.{}

The first line is the original sentence, and each line after that is a word in the sentence and suggested substitutions with respective frequencies.

Step 4: Generate Augmented corpora

Using the substitution output, the [trg/src] side of the bitext, the alignment, and the lexical probability file you can generate the augmented corpora.

You can use fast_align to obtain alignments for your bitext. The format of the alignment input is:

...
0-0 1-10 2-3 2-4 2-5 3-13 4-14 5-8 5-9 6-16 7-14 8-11 10-6 11-7 12-17
0-0 1-0 2-0 2-2 3-1 3-3 4-5 5-5 6-6 8-8 9-9 10-10 11-11
...

The lexical probability input can be obtained from a dictionary, or the alignments. The format is:

...
safely sicher 0.0051237409068
safemode safemode 1
safeness antikollisionssystem 0.3333333
safer sicherer 0.09545972221228
...

In order to generate the augmented bitext you can run:

perl ./scripts/data_augmentation.pl subs.intersect train.de alignment.txt lex.txt augmentedOutput

This will generate two files: augmentedOutput.augmented in [src/trg] and augmentedOutput.fillout in [trg/src] language. The first file is the side of the bitext augmented targeting the rare words. The second file is respective translations of the augmented sentences.

If you want to have more than one change in each sentence you can also run:

perl ./scripts/data_augmentation_multiplechanges.pl subs.intersect train.de alignment.txt lex.txt augmentedOutput

An example of the output

Here is a sentence from the augmented file in [src/trg]:

at the same time , the rights of consumers began:604~need to be maintained .

and respective sentence from the fillout file in [trg/src]:

gleichzeitig begann~müssen die rechte der verbraucher geschützt werden .

In the augmented file the word began with frequncy 604 substitutes the word need. In the fillout file the translation of the word, begann, substitutes the original word müssen.

Step 5: Generate Clean Bitext for Translation

To remove all markups and have clean bitext that can be used for translation training you can run:

perl ./scripts/filter_out_augmentations.pl augmentedOutput.en augmentedOutput.de 1000

You can impose further frequncy limit on rare words you want to augment here.

Acknowledgments

In this work this code is utilized:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].