All Projects β†’ ymoslem β†’ MT-Preparation

ymoslem / MT-Preparation

Licence: other
Machine Translation (MT) Preparation Scripts

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to MT-Preparation

Nspm
πŸ€– Neural SPARQL Machines for Knowledge Graph Question Answering.
Stars: ✭ 156 (+940%)
Mutual labels:  neural-machine-translation
Good Papers
I try my best to keep updated cutting-edge knowledge in Machine Learning/Deep Learning and Natural Language Processing. These are my notes on some good papers
Stars: ✭ 248 (+1553.33%)
Mutual labels:  neural-machine-translation
TS3000 TheChatBOT
Its a social networking chat-bot trained on Reddit dataset . It supports open bounded queries developed on the concept of Neural Machine Translation. Beware of its being sarcastic just like its creator 😝 BDW it uses Pytorch framework and Python3.
Stars: ✭ 20 (+33.33%)
Mutual labels:  neural-machine-translation
Document Transformer
Improving the Transformer translation model with document-level context
Stars: ✭ 160 (+966.67%)
Mutual labels:  neural-machine-translation
Modernmt
Neural Adaptive Machine Translation that adapts to context and learns from corrections.
Stars: ✭ 231 (+1440%)
Mutual labels:  neural-machine-translation
vat nmt
Implementation of "Effective Adversarial Regularization for Neural Machine Translation", ACL 2019
Stars: ✭ 22 (+46.67%)
Mutual labels:  neural-machine-translation
Code Docstring Corpus
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Stars: ✭ 137 (+813.33%)
Mutual labels:  neural-machine-translation
2018-dlsl
UPC Deep Learning for Speech and Language 2018
Stars: ✭ 18 (+20%)
Mutual labels:  neural-machine-translation
Tensorflow Shakespeare
Neural machine translation between the writings of Shakespeare and modern English using TensorFlow
Stars: ✭ 244 (+1526.67%)
Mutual labels:  neural-machine-translation
DCGCN
Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning (authors' MXNet implementation for the TACL19 paper)
Stars: ✭ 73 (+386.67%)
Mutual labels:  neural-machine-translation
Npmt
Towards Neural Phrase-based Machine Translation
Stars: ✭ 175 (+1066.67%)
Mutual labels:  neural-machine-translation
Opennmt
Open Source Neural Machine Translation in Torch (deprecated)
Stars: ✭ 2,339 (+15493.33%)
Mutual labels:  neural-machine-translation
bytenet translation
A TensorFlow Implementation of Machine Translation In Neural Machine Translation in Linear Time
Stars: ✭ 60 (+300%)
Mutual labels:  neural-machine-translation
Mtbook
γ€ŠζœΊε™¨ηΏ»θ―‘οΌšεŸΊη‘€δΈŽζ¨‘εž‹γ€‹θ‚–ζ‘ ζœ±ι–ζ³’ θ‘— - Machine Translation: Foundations and Models
Stars: ✭ 2,307 (+15280%)
Mutual labels:  neural-machine-translation
Neural-Machine-Translation
Several basic neural machine translation models implemented by PyTorch & TensorFlow
Stars: ✭ 29 (+93.33%)
Mutual labels:  neural-machine-translation
Ctranslate2
Fast inference engine for OpenNMT models
Stars: ✭ 140 (+833.33%)
Mutual labels:  neural-machine-translation
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+22686.67%)
Mutual labels:  neural-machine-translation
parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Stars: ✭ 35 (+133.33%)
Mutual labels:  neural-machine-translation
Word-Level-Eng-Mar-NMT
Translating English sentences to Marathi using Neural Machine Translation
Stars: ✭ 37 (+146.67%)
Mutual labels:  neural-machine-translation
bergamot-translator
Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
Stars: ✭ 181 (+1106.67%)
Mutual labels:  neural-machine-translation

MT-Preparation

Machine Translation (MT) Preparation Scripts

Installing Requirements

The filtering and subwording scripts use a number of Python packages. To install these dependencies using pip run the following command in Terminal/CMD:

pip3 install --user -r requirements.txt

Filtering

There is one script to use for cleaning your Machine Translation dataset. You must have two files, one for the source and one for the target. If you rather have one TMX file, you can first use the TXM2MT converter.

The filter script achieves the following steps:

  • Deleting empty rows;
  • Deleting duplicates;
  • Deleting source-copied rows;
  • Deleting too long Source/Target (ratio 200% and > 200 words);
  • Removing HTML;
  • Segments will remain in the true-case unless lower is True;
  • Shuffling rows; and
  • writing the output files.

Run the filtering script in the Terminal/CMD as follows:

python3 filter.py <source_file_path> <target_file_path> <source_lang> <target_lang>

Subwording

It is recommended to run the subwording process, as it helps your Machine Translation engine avoid out-of-vocabulary tokens. The subwording scripts apply SentencePiece to your source and target Machine Translation files. There are three scripts provided:

1. Train a subwording model

You need to create two subwording models to learn the vocabulary of your source and target.

python3 train.py <train_source_file_tok> <train_target_file_tok>

By default, the subwording model type is unigram. You can change it BPE by adding --model_type=bpe to these lines in the script as follows:

source_train_value = '--input='+train_source_file_tok+' --model_prefix=source --vocab_size='+str(source_vocab_size)+' --hard_vocab_limit=false --model_type=bpe'
target_train_value = '--input='+train_target_file_tok+' --model_prefix=target --vocab_size='+str(target_vocab_size)+' --hard_vocab_limit=false --model_type=bpe'

Optionally, you can add more options like --split_digits=true to split all digits (0-9) into separate pieces, or --byte_fallback=true to decompose unknown pieces into UTF-8 byte pieces, which might help avoid out of vocabulary tokens.

Notes for big corpora:

  • You can use --train_extremely_large_corpus=true for a big corpus to avoid memory issues.
  • The default SentencePiece value for --input_sentence_size is 0, i.e. the whole corpus. You can change it to a value between 1 and 10 million sentences, which will be enough for creating a good SentencePiece model.
  • When the value of --input_sentence_size is less than the size of the corpus, it is recommended to set --shuffle_input_sentence=true to make your sample representative to the distribution of your data.
  • The default SentencePiece value for --vocab_size is 8,000. You can go for a higher value between 30,000 and 50,000, and up to 100,000 for a big corpus. Still, note that smaller values will encourage the model to make more splits on words, which might be better in the case of a multilingual model if the languages share the alphabet.

2. Subword

In this step, you use the models you created in the previous step to subword your source and target Machine Translation files. You have to apply the same step on the source files to be translated later with the Machine Translation model.

python3 subword.py <sp_source_model_path> <sp_target_model_path> <source_file_path> <target_file_path>

Notes for OpenNMT users:

  • If you are using OpenNMT, you can add <s> and </s> to the source only. Remove <s> and </s> from the target as they are already added by default (reference). Alternatively, in OpenNMT-tf, there is an option called source_sequence_controls to add start and/or end tokens to the source.
  • After you segment your source and target files with the generated SentencePiece models, you must build vocab using OpenNMT-py to generate vocab files compatible with it. OpenNMT-tf has an option that allows converting SentencePiece vocab to a compatible format.
  • Before you start training with OpenNMT-py, you must configure src_vocab_size and tgt_vocab_size to exactly match the value you used for --vocab_size in SentencePiece. The default is 50000, which is usually good.

3. Desubword

This step is useful after training your Machine Translation model and translating files with it, as you need to decode/desubword the generated target (i.e. translated) files.

python3 desubword.py <target_model_file> <target_pred_file>

Extracting Training and Development Datasets

In this step, you split the parallel dataset into training and development datasets. The first argument is the number of segments you want in the development dataset; the script randomly selects this number of segments for the dev set and keeps the rest for the train set.

python3 train_dev_split.py <dev_segment_number> <source_file_path> <target_file_path>

Google Colab Notebooks

Questions

If you have questions or suggestions, please feel free to contact me.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].