All Projects → dashayushman → deep-trans

dashayushman / deep-trans

Licence: MIT license
Transliterating English to Hindi using Recurrent Neural Networks

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to deep-trans

dhs summit 2019 image captioning
Image captioning using attention models
Stars: ✭ 34 (-22.73%)
Mutual labels:  lstm, sequence-to-sequence
Language Translation
Neural machine translator for English2German translation.
Stars: ✭ 82 (+86.36%)
Mutual labels:  lstm, sequence-to-sequence
Deeplearning.ai Assignments
Stars: ✭ 268 (+509.09%)
Mutual labels:  lstm, sequence-to-sequence
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+7668.18%)
Mutual labels:  lstm, sequence-to-sequence
Hred Attention Tensorflow
An extension on the Hierachical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion, our implementation is in Tensorflow and uses an attention mechanism.
Stars: ✭ 68 (+54.55%)
Mutual labels:  lstm, sequence-to-sequence
Deeptranslit
Efficient and easy to use transliteration for Indian languages
Stars: ✭ 41 (-6.82%)
Mutual labels:  transliteration, lstm
deep-char-cnn-lstm
Deep Character CNN LSTM Encoder with Classification and Similarity Models
Stars: ✭ 20 (-54.55%)
Mutual labels:  lstm
Simple-Tensor
A simplification of Tensorflow Tensor Operations
Stars: ✭ 17 (-61.36%)
Mutual labels:  lstm
keras-malicious-url-detector
Malicious URL detector using keras recurrent networks and scikit-learn classifiers
Stars: ✭ 24 (-45.45%)
Mutual labels:  lstm
datastories-semeval2017-task6
Deep-learning model presented in "DataStories at SemEval-2017 Task 6: Siamese LSTM with Attention for Humorous Text Comparison".
Stars: ✭ 20 (-54.55%)
Mutual labels:  lstm
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (-9.09%)
Mutual labels:  lstm
Tianchi2020ChineseMedicineQuestionGeneration
2020 阿里云天池大数据竞赛-中医药文献问题生成挑战赛
Stars: ✭ 20 (-54.55%)
Mutual labels:  sequence-to-sequence
unidecoder
Replace Unicode characters with sensible US-ASCII equivalents
Stars: ✭ 67 (+52.27%)
Mutual labels:  transliteration
A-Persona-Based-Neural-Conversation-Model
No description or website provided.
Stars: ✭ 22 (-50%)
Mutual labels:  sequence-to-sequence
AdvancedDeepLearning
Advanced Deep Learning
Stars: ✭ 28 (-36.36%)
Mutual labels:  lstm
extkeras
Playground for implementing custom layers and other components compatible with keras, with the purpose to learn the framework better and perhaps in future offer some utils for others.
Stars: ✭ 18 (-59.09%)
Mutual labels:  lstm
lstm-electric-load-forecast
Electric load forecast using Long-Short-Term-Memory (LSTM) recurrent neural network
Stars: ✭ 56 (+27.27%)
Mutual labels:  lstm
novel writer
Train LSTM to writer novel (HongLouMeng here) in Pytorch.
Stars: ✭ 14 (-68.18%)
Mutual labels:  lstm
LearningMetersPoems
Official repo of the article: Yousef, W. A., Ibrahime, O. M., Madbouly, T. M., & Mahmoud, M. A. (2019), "Learning meters of arabic and english poems with recurrent neural networks: a step forward for language understanding and synthesis", arXiv preprint arXiv:1905.05700
Stars: ✭ 18 (-59.09%)
Mutual labels:  lstm
sequence-rnn-py
Sequence analyzing using Recurrent Neural Networks (RNN) based on Keras
Stars: ✭ 28 (-36.36%)
Mutual labels:  lstm

alt tag

DeepTrans is a character level language model for transliterating English text into Hindi. It is based on the attention mechanism presented in [1] and its implementation in Tensorflow's Sequence to Sequence models. This project has been inspired by the translation model presented in tensorflow's sequence to sequence model. This project comes with a pretrained model (2 layers with 256 units each) for Hindi but can be easily trained over the existing model or from scratch. The pretrained models are trained on lowercase words. If you wish to train your own model then feel free to do whatever you want and I would be glad if you could share your results and models with me. I hope to see interesting results.

Prerequisites

  1. Tensorflow (Version >= 0.9)
  2. Python 2.7

I have tested it on an Ubuntu 15.04 with NVIDIA GeForce GT 740M Graphics card with Tensorflow running in a virtual environment. It should ideally run smoothly on any other system with tensorflow installed in it.

Installation and Setup

Clone Repository

git clone https://github.com/dashayushman/deep-trans.git

Run Tests

python transliterate.py --self_test

This will generate a fake model (2 layers 32 units per layer) with fake data and trains it for 5 steps.
If the code returns without any errors, proceed to the next step.

Download The Model and Vocabulary

  1. Download the pre-trained model from here and extract the model files to any folder in your system. The folder structure for models looks something like the following,
trained_model
    |_version_1.0
            |_model_12_09_2016.zip
            |_model_12_09_2016.tar
    |_version_0.1
            |_model_9_08_2016.zip
            |_model_9_08_2016.tar
  1. Download the vocabulary from here and extract the vocabulary files to any folder in your system. The folder structure for vocabulary looks something like the following,
vocabulary
    |_version_1.0
            |_vocab_12_09_2016.zip
            |_vocab_12_09_2016.tar
    |_version_0.1
            |_vocab_9_08_2016.zip
            |_vocab_9_08_2016.tar

The pretrained models and vocabularies are versioned with a date attached to the name of the compressed files. Downloading the latest version is recommended. You will find both .tar and .zip files in the download link. Both of them have the same model so you can download any one. Make sure that your model and vocabulary date and version match.

Load and Run

Loading the model

Execute the following command from your commandline to load the pre-trained models and enter an interactive mode where you can input english strings in the standard input and check results there itself.

python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --decode

Your commandline should have something like this alt tag

You can enter your 'English word' after the '>' in the command like and hit enter to see results.

Transliterate a file

Execute the following command from your commandline to load the pre-trained models and transliterate an entire file.
Make sure your file contains one english word per line and is named 'test.en'

python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --transliterate_file --transliterate_file_dir <path_to_directory_that_contains_test.en>

If you get a 'done generating the output file!!!' message on your commandline, then you are good to go. You will find a 'results.txt' file in your 'transliterate_file_dir'

Train Your Own Model

Requirements

  1. Training and development files: You will need two set of files for training your own model.
  • Training Files: You would need two training files with file names 'train.rel.2.en' and 'train.rel.2.hn'. The 'train.rel.2.en' should contain all the english words for training with one word per line and each character separated by a space. Similarly 'train.rel.2.hn' should contain corresponding hindi words for the english words in 'train.rel.2.en' with one word per line and each character separated by a space. Make sure that the English and Hindi words correspond otherwise you will end up training a very messy model.
  • Development Files: You would need two development files with file names 'test.rel.2.en' and 'test.rel.2.hn'. The 'test.rel.2.en' should contain english words for validation with one word per line and each character separated by a space. Similarly 'test.rel.2.hn' should contain corresponding hindi words for the english words in 'train.rel.2.en' with one word per line and each character separated by a space. Make sure that the English and Hindi words correspond.
  1. Try not to overlap the development set and training set.
  2. Keep these files in a directory.
  3. Very Important Point To Note: Due to the Character encoding issues in python 2.7 I have to put these restrictions on formatting the data (adding spaces between every character in a word). I will soon release another version with Python3+ support and solve this encoding issue and remove this weird data formatting restriction.
  4. This is how the data files should look like:
    alt tag

Training

Once you have the above files in a directory, execute the following command to start training your own model.

python transliterate.py --data_dir <path_to_directory_with_training_and_development_files> --train_dir <path_to_a_directory_to_save_checkpoints> --size=2<number_units_per_layer> --num_layers=<number_of_layers> --steps_per_checkpoint=<number_of_steps_to_save_a_checkpoint>

The following is a real example of the above,

python transliterate.py --data_dir /home/ayushman/projects/transliterate/train_test_data/ --train_dir /home/ayushman/projects/transliterate/chkpnts/ --size=1024 --num_layers=5 --steps_per_checkpoint=1000

FLAGS

The following is a list of available flags that you can set for changing the model parameters.

FLAG VALUE TYPE DEFAULT VALUE DESCRIPTION
learning_rate Float 0.001 Learning rate for backpropagation through time.
learning_rate_decay_factor Float 0.99 Learning rate decays by this much.
max_gradient_norm Float 5.0 Clip gradients to this norm.
batch_size Integer 10 Batch size to use during training.
size Integer 256 Size of each model layer.
num_layers Integer 2 Number of layers in the model.
en_vocab_size Integer 40000 English vocabulary size.
hn_vocab_size Integer 40000 Hindi vocabulary size.
data_dir String(path) /tmp Data directory
transliterate_file_dir String(path) /tmp Data directory
train_dir String(path) /tmp Training directory (to save checkpoints or models).
max_train_data_size Integer 0 Limit on the size of training data (0: no limit).
steps_per_checkpoint Integer 200 How many training steps to do per checkpoint.
decode Boolean False et to True for interactive decoding.
transliterate_file Boolean False Set to True for transliterating a file.
self_test Boolean False Run a self-test if this is set to True.

References

  1. Sequence to Sequence Learning with Neural Networks
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].