Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → twairball → Fairseq Zh En

twairball / Fairseq Zh En

NMT for chinese-english using fairseq

Labels

jupyter-notebook

Projects that are alternatives of or similar to Fairseq Zh En

Surgery Robot Detection Segmentation

Object detection and segmentation for a surgery robot using Mask-RCNN on Python 3, Keras, and TensorFlow..

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Carnd Camera Calibration

Images and notebook for camera calibration

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

[DEPRECATED] 🔊️ Audio with fastaiv1

Stars: ✭ 156 (+0.65%)

Mutual labels: jupyter-notebook

A library to model multivariate data using copulas.

Stars: ✭ 149 (-3.87%)

Mutual labels: jupyter-notebook

Ml Training Advanced

Materials for the "Advanced Scikit-learn" class in the afternoon

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Enables the use of Stata together with Python via Jupyter (IPython) notebooks.

Stars: ✭ 154 (-0.65%)

Mutual labels: jupyter-notebook

Pytorchmedicalai

This is the hands-on deep learning tutorial series for the 2018/2019 Medical AI course by DeepOncology AI.

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

COMS W4995 Applied Machine Learning - Spring 18

Stars: ✭ 156 (+0.65%)

Mutual labels: jupyter-notebook

Powerful financial charting library based on R's Quantmod | http://py-quantmod.readthedocs.io/en/latest/

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Links to Russian corpora + Python functions for loading and parsing

Stars: ✭ 154 (-0.65%)

Mutual labels: jupyter-notebook

Fcn For Semantic Segmentation

Implemention of FCN-8 and FCN-16 with Keras and uses CRF as post processing

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Stars: ✭ 154 (-0.65%)

Mutual labels: jupyter-notebook

DataGene - Identify How Similar TS Datasets Are to One Another (by @firmai)

Stars: ✭ 156 (+0.65%)

Mutual labels: jupyter-notebook

Multiscale Geographically Weighted Regression (MGWR)

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Anomaliesinoptions

In this notebook we will explore a machine learning approach to find anomalies in stock options pricing.

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Spiking Neural Network Snn With Pytorch Where Backpropagation Engenders Stdp

What about coding a Spiking Neural Network using an automatic differentiation framework? In SNNs, there is a time axis and the neural network sees data throughout time, and activation functions are instead spikes that are raised past a certain pre-activation threshold. Pre-activation values constantly fades if neurons aren't excited enough.

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Deep Q Learning

Stars: ✭ 155 (+0%)

Mutual labels: jupyter-notebook

Altair notebooks

Tutorial and Examples Jupyter Notebooks for Altair

Stars: ✭ 156 (+0.65%)

Mutual labels: jupyter-notebook

Keras Segmentation Deeplab V3.1

An awesome semantic segmentation model that runs in real time

Stars: ✭ 156 (+0.65%)

Mutual labels: jupyter-notebook

Rnn lstm from scratch

How to build RNNs and LSTMs from scratch with NumPy.

Stars: ✭ 156 (+0.65%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

Chinese-English NMT

Experiments and reproduction of pretrained models trained on WMT17 Chinese-English using fairseq

Abstract

A big pain point for any RNN/LSTM model training is that they are very time consuming, so fairseq proposed fully convolutional architecture is very appealing. Some cursory experiments show much faster training time for fconv (Fully Convolutional Sequence-to-Sequence) compared to blstm (Bi-LSTM), while yielding comparable results. While fconv measures slightly worse BLEU scores vs blstm, some manual tests seem to favor fconv. A hybrid model using convenc (Convolutional encoder, LSTM decoder) trains for much more epochs but performs much worse BLEU score.

Model	Epochs	Training Time	BLEU4 (beam1)	BLEU4 (beam5)	BLEU4 (beam10)	BLEU4 (beam20)
fconv	25	~4.5hrs	63.49	62.22	62.52	62.74
fconv_enc7	33	~5hrs	66.40	65.52	65.8	65.96
fconv_dec5	28	~5hrs	65.65	64.71	64.91	64.98
blstm	30	~8hrs	64.59	64.15	64.38	63.76
convenc	47	~7hrs	50.91	56.71	56.83	53.66

Download

Pretrained models:

wmt17.zh-en.fconv-cuda: Pre-trained model for WMT17 Chinese-English
wmt17.zh-en.fconv-float: CPU version of the above

Install

Follow fairseq installation, then:

# Chinese tokenizer
$ pip install jieba

# English tokenizer
$ pip install nltk
$ mkdir -p ~/nltk_data/tokenizers/
$ wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip -o ~/nltk_data/tokenizers/punkt.zip
$ unzip ~/nltk_data/tokenizers/punkt.zip ~/nltk_data/tokenizers/

Additionally, we use scripts from Moses and Subword-nmt

git clone https://github.com/moses-smt/mosesdecoder

git clone https://github.com/rsennrich/subword-nmt

Additional Setup

CUDA might need to link libraries to path.

# Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH: /git/torch/install/lib:
$ cd $LD_LIBRARY_PATH; 
$ sudo ln -s  /usr/local/cuda-8.0/extras/CUPTI/lib64/* $LD_LIBRARY_PATH/

Preprocessing

Word Token

We tokenize dataset, using nltk.word_tokenizer for English and jieba for Chinese word segmentation.

Casing

We remove cases from English and converted all string to lower case.

Merge blank lines

We note that dataset often has blank lines. In some cases, this is formatting, but there are cases where a long English sentence is translated to 2 Chinese sentences. This appears as a sentence followed by blank line on English corpus. To deal with this, we merge the 2 Chinese sentences onto same line, and then remove the blank line from both corpus.

There are also formatting issues, where English corpus has blank line while Chinese corpus has a single .. We treat this as both blank lines and remove them.

Additional data cleaning

We note that there are further work that can be added to data cleaning:

remove non-English/Chinese sentences. I think there was a Russian sentence.
remove (HTML?) markup
remove non-breaking white space. \xa20 was found and converted to whitespace.

Preprocessing

Preprocessing is run by wmt_prepare.sh.

We download, unzip, tokenize, and clean dataset in preprocess/wmt.py.
We learn subword vocabulary using apply_bpe.
Then preprocess datasets to binary using fairseq preprocess

Training

Run wmt17_train.sh which does the following:

$ DATADIR=data-bin/wmt17_en_zh
$ TRAIN=trainings/wmt17_en_zh

# Standard bi-directional LSTM model
$ mkdir -p $TRAIN/blstm
$ fairseq train -sourcelang en -targetlang zh -datadir $DATADIR \
    -model blstm -nhid 512 -dropout 0.2 -dropout_hid 0 -optim adam -lr 0.0003125 \
    -savedir $TRAIN/blstm

# Fully convolutional sequence-to-sequence model
$ mkdir -p $TRAIN/fconv
$ fairseq train -sourcelang en -targetlang zh -datadir $DATADIR \
    -model fconv -nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 -clip 0.1 \
    -momentum 0.99 -timeavg -bptt 0 -savedir $TRAIN/fconv

# Convolutional encoder, LSTM decoder
$ mkdir -p trainings/convenc
$ fairseq train -sourcelang en -targetlang zh -datadir $DATADIR \
    -model conv -nenclayer 6 -dropout 0.2 -dropout_hid 0 -savedir trainings/convenc

Generate

Run wmt17_generate.sh, or run generate-line as follows:

$ DATADIR=data-bin/wmt17_en_zh

# Optional: optimize for generation speed (fconv only)
$ fairseq optimize-fconv -input_model trainings/fconv/model_best.th7 -output_model trainings/fconv/model_best_opt.th7

$ fairseq generate-lines -sourcedict $DATADIR/dict.en.th7 -targetdict $DATADIR/dict.zh.th7 -path trainings/fconv/model_best_opt.th7 -beam 10 -nbest 2
# you actually have to implement the solution
# <unk> 实际上 必须 实施 解决办法 。

$ fairseq generate-lines -sourcedict $DATADIR/dict.en.th7 -targetdict $DATADIR/dict.zh.th7 -path trainings/blstm/model_best.th7 -beam 10 -nbest 2
# you actually have to implement the solution
# <unk> ， 这些 方案 必须 非常 困难 

$ fairseq generate-lines -sourcedict $DATADIR/dict.en.th7 -targetdict $DATADIR/dict.zh.th7 -path trainings/convenc/model_best.th7 -beam 10 -nbest 2
# you actually have to implement the solution
# <unk> 这种 道德 又 能 实现 这些 目标 。

References

@article{gehring2017convs2s,
  author          = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title           = "{Convolutional Sequence to Sequence Learning}",
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1705.03122},
  primaryClass    = "cs.CL",
  keywords        = {Computer Science - Computation and Language},
  year            = 2017,
  month           = May,
}

@article{gehring2016convenc,
  author          = {Gehring, Jonas, and Auli, Michael and Grangier, David and Dauphin, Yann N},
  title           = "{A Convolutional Encoder Model for Neural Machine Translation}",
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1611.02344},
  primaryClass    = "cs.CL",
  keywords        = {Computer Science - Computation and Language},
  year            = 2016,
  month           = Nov,
}

License

fairseq is licensed from its original repo.

Pretrained models in this repo are BSD-licensed.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 155

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗