All Projects → vietai → Dab

vietai / Dab

Licence: gpl-3.0
Data Augmentation by Backtranslation (DAB) ヽ( •_-)ᕗ

Programming Languages

language
365 projects

Projects that are alternatives of or similar to Dab

Applied Deep Learning With Tensorflow
Learn applied deep learning from zero to deployment using TensorFlow 1.8+
Stars: ✭ 160 (-45.58%)
Mutual labels:  google-cloud, jupyter-notebook, deep-neural-networks
All Classifiers 2019
A collection of computer vision projects for Acute Lymphoblastic Leukemia classification/early detection.
Stars: ✭ 22 (-92.52%)
Mutual labels:  jupyter-notebook, data-augmentation, deep-neural-networks
Sockeye
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet
Stars: ✭ 990 (+236.73%)
Mutual labels:  deep-neural-networks, attention-is-all-you-need, transformer
Bertqa Attention On Steroids
BertQA - Attention on Steroids
Stars: ✭ 112 (-61.9%)
Mutual labels:  jupyter-notebook, nlp-machine-learning, transformer
Pytorch Original Transformer
My implementation of the original transformer model (Vaswani et al.). I've additionally included the playground.py file for visualizing otherwise seemingly hard concepts. Currently included IWSLT pretrained models.
Stars: ✭ 411 (+39.8%)
Mutual labels:  jupyter-notebook, attention-is-all-you-need, transformer
Mixup Generator
An implementation of "mixup: Beyond Empirical Risk Minimization"
Stars: ✭ 250 (-14.97%)
Mutual labels:  jupyter-notebook, data-augmentation, deep-neural-networks
Qwiklabs
labs guide for completing qwiklabs challenge
Stars: ✭ 103 (-64.97%)
Mutual labels:  google-cloud, jupyter-notebook
transformer
A simple TensorFlow implementation of the Transformer
Stars: ✭ 25 (-91.5%)
Mutual labels:  transformer, attention-is-all-you-need
KitanaQA
KitanaQA: Adversarial training and data augmentation for neural question-answering models
Stars: ✭ 58 (-80.27%)
Mutual labels:  transformer, data-augmentation
speech-transformer
Transformer implementation speciaized in speech recognition tasks using Pytorch.
Stars: ✭ 40 (-86.39%)
Mutual labels:  transformer, attention-is-all-you-need
Esper Tv
Esper instance for TV news analysis
Stars: ✭ 37 (-87.41%)
Mutual labels:  google-cloud, jupyter-notebook
OpenPrompt
An Open-Source Framework for Prompt-Learning.
Stars: ✭ 1,769 (+501.7%)
Mutual labels:  transformer, nlp-machine-learning
transformer
Neutron: A pytorch based implementation of Transformer and its variants.
Stars: ✭ 60 (-79.59%)
Mutual labels:  transformer, attention-is-all-you-need
Tf Serving K8s Tutorial
A Tutorial for Serving Tensorflow Models using Kubernetes
Stars: ✭ 78 (-73.47%)
Mutual labels:  google-cloud, jupyter-notebook
kospeech
Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition leveraging PyTorch and Hydra.
Stars: ✭ 456 (+55.1%)
Mutual labels:  transformer, attention-is-all-you-need
Imagenet
Trial on kaggle imagenet object localization by yolo v3 in google cloud
Stars: ✭ 56 (-80.95%)
Mutual labels:  google-cloud, jupyter-notebook
transformer
A PyTorch Implementation of "Attention Is All You Need"
Stars: ✭ 28 (-90.48%)
Mutual labels:  transformer, attention-is-all-you-need
Rad
RAD: Reinforcement Learning with Augmented Data
Stars: ✭ 268 (-8.84%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Dlpython course
Примеры для курса "Программирование глубоких нейронных сетей на Python"
Stars: ✭ 266 (-9.52%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Drq
DrQ: Data regularized Q
Stars: ✭ 268 (-8.84%)
Mutual labels:  jupyter-notebook, data-augmentation

✨ Data Augmentation by Back-translation (DAB) ✨

This repository builds on the idea of back translation [1] as a data augmentation method [2, 3]. The idea is simple: translating a sentence in one language to another and then back to the original language. This way one can multiply the size of any NLP dataset. An example using our code is shown below:

In this project we provide a nice interface for people to investigate back-translation models interactively that works with any tensor2tensor checkpoints. We also provide the option to perform back-translation in batch mode for back-translating a full dataset, see this section. Here we provide two sets of trained checkpoints:

📓 Interactive Back-translation.

We use this Colab Notebook to generate the GIF you saw above.

📓 A Case Study on Back-translation for Low-resource Languages

Unsupervised Data Augmentation [3] has demonstrated improvements for high-resource languages (English) by back-translation. In this work, we conduct a case study for Vietnamese through the following Colab Notebook.

On a Sentiment Analysis dataset with only 10K examples, we use back-translation to double the training set size and obtain an improvement of near 2.5% in absolute accuracy:

Original set Augmented by Back Translation
83.48 % 85.91 %

Here is another GIF demo with Vietnamese sentences - for fun ;)

How to contribute? 🤔

🌱 More and/or better translation models. Check out Appendix A for Colab Notebook tutorials on how to train translation models with tensor2tensor.

🌱 More and/or better translation data or monolingual data.

🌱 Code to make our code even easier to use - including tests (Travis, CodeCov).

🌱 Texts/Illustrations to make our documentation even easier to understand.

We will be working on a more detailed guideline for contribution.

BibTex 🐝

@article{trieu19backtranslate,
  author  = {Trieu H. Trinh and Thang Le and Phat Hoang and Minh{-}Thang Luong},
  title   = {A Tutorial on Data Augmentation by Backtranslation (DAB)},
  journal = {https://github.com/vietai/dab},
  year    = {2019},
}

References

[1] Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data.", ACL 2016.

[2] Edunov, Sergey, et al. "Understanding back-translation at scale.", EMNLP 2018.

[3] Xie, Qizhe, et al. "Unsupervised data augmentation." arXiv preprint arXiv:1904.12848 (2019).

[4] Clark, Kevin, et al. "Semi-supervised sequence modeling with cross-view training.", EMNLP 2018.

Appendix A: Training Translation Models with tensor2tensor

📓 Training Translation Models: How to connect to GPU/TPU and Google Drive/Cloud storage, download training/testing data from the internet and train/evaluate your models. We use the IWSLT'15 dataset for the English-Vietnamese pair, off-the-shelf Transformer implementation from tensor2tensor with its transformer_tiny setting and obtain the following result:

BLEU score
English to Vietnamese 28.7
Vietnamese to English 27.8

As of this writing, the result above is already competitive with the current state-of-the-art (29.6 BLEU score) [4], without using semi-supervised or multi-task learning. More importantly, this result is good enough to be useful for the purpose of this project! For English-French, we use the transformer_big provided in the open-source implementation of Unsupervised Data Augmentation [3].

📓 Analyse your Translation Models: Play with and visualize the trained models attention.

Appendix B: Command Syntax

The remaining of this README is for those who cannot have access to our Colab Notebooks and/or only need a quick reference to the command syntax of our code.

Requirements

We make use of the tensor2tensor library to build deep neural networks that perform translation.

Training the two translation models

A prerequisite to performing back-translation is to train two translation models: English to Vietnamese and Vietnamese to English. A demonstration of the following commands to generate data, train and evaluate the models can be found in this Google Colab.

Generate data (tfrecords)

For English -> Vietnamese

python t2t_datagen.py \
--data_dir=data/translate_envi_iwslt32k \
--tmp_dir=tmp/ \
--problem=translate_envi_iwslt32k

For Vietnamese -> English

python t2t_datagen.py \
--data_dir=data/translate_vien_iwslt32k \
--tmp_dir=tmp/ \
--problem=translate_vien_iwslt32k

Train

Some examples to train your translation models with the Transformer architecture:

For English -> Vietnamese

python t2t_trainer.py \
--data_dir=data/translate_envi_iwslt32k \
--problem=translate_envi_iwslt32k \
--hparams_set=transformer_tiny \
--model=transformer \
--output_dir=checkpoints/envi

For Vietnamese -> English

python t2t_trainer.py \
--data_dir=data/translate_vien_iwslt32k \
--problem=translate_vien_iwslt32k \
--hparams_set=transformer_tiny \
--model=transformer \
--output_dir=checkpoints/vien

Analyse the trained models

Once you finished training and evaluating the models, you can certainly play around with them a bit. For example, you might want to run some interactive translation and/or visualize the attention masks for your inputs of choice. This is demonstrated in this Google Colab.

Back translate from a text file.

We have trained two translation models (vien and envi) using the tiny setting of tensor2tensor's Transformer, and put it on Google Cloud Storage with public access for you to use.

Here is an example of back translating Vietnamese -> English -> Vietnamese from an input text file.

python back_translate.py \
--decode_hparams="beam_size=4,alpha=0.6" \
--paraphrase_from_file=test_input.vi \
--paraphrase_to_file=test_output.vi \
--model=transformer \
--hparams_set=transformer_tiny \
--from_ckpt=checkpoints/vien \
--to_ckpt=checkpoints/envi \
--from_data_dir=data/translate_vien_iwslt32k \
--to_data_dir=data/translate_envi_iwslt32k

Add --backtraslate_interactively to back-translate interactively from your terminal. Alternatively, you can also check out this Colab.

For a demonstration of augmenting real datasets by back-translation and obtaining actual gains in accuracy, check out this Google Colab!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].