All Projects → Huffon → pytorch-transformer-kor-eng

Huffon / pytorch-transformer-kor-eng

Licence: other
Transformer Implementation using PyTorch for Neural Machine Translation (Korean to English)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pytorch-transformer-kor-eng

Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+8445%)
Mutual labels:  sequence-to-sequence, torchtext
dhs summit 2019 image captioning
Image captioning using attention models
Stars: ✭ 34 (-15%)
Mutual labels:  sequence-to-sequence
Word-Level-Eng-Mar-NMT
Translating English sentences to Marathi using Neural Machine Translation
Stars: ✭ 37 (-7.5%)
Mutual labels:  sequence-to-sequence
Tianchi2020ChineseMedicineQuestionGeneration
2020 阿里云天池大数据竞赛-中医药文献问题生成挑战赛
Stars: ✭ 20 (-50%)
Mutual labels:  sequence-to-sequence
Transformer-Transducer
PyTorch implementation of "Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss" (ICASSP 2020)
Stars: ✭ 61 (+52.5%)
Mutual labels:  sequence-to-sequence
deep-trans
Transliterating English to Hindi using Recurrent Neural Networks
Stars: ✭ 44 (+10%)
Mutual labels:  sequence-to-sequence
deep-spell-checkr
Keras implementation of character-level sequence-to-sequence learning for spelling correction
Stars: ✭ 65 (+62.5%)
Mutual labels:  sequence-to-sequence
text-generation-transformer
text generation based on transformer
Stars: ✭ 36 (-10%)
Mutual labels:  sequence-to-sequence
Natural-Language-Processing
Contains various architectures and novel paper implementations for Natural Language Processing tasks like Sequence Modelling and Neural Machine Translation.
Stars: ✭ 48 (+20%)
Mutual labels:  sequence-to-sequence
Sequence-to-Sequence-Learning-of-Financial-Time-Series-in-Algorithmic-Trading
My bachelor's thesis—analyzing the application of LSTM-based RNNs on financial markets. 🤓
Stars: ✭ 64 (+60%)
Mutual labels:  sequence-to-sequence
A-Persona-Based-Neural-Conversation-Model
No description or website provided.
Stars: ✭ 22 (-45%)
Mutual labels:  sequence-to-sequence
dynmt-py
Neural machine translation implementation using dynet's python bindings
Stars: ✭ 17 (-57.5%)
Mutual labels:  sequence-to-sequence
protein-transformer
Predicting protein structure through sequence modeling
Stars: ✭ 77 (+92.5%)
Mutual labels:  sequence-to-sequence
HE2LaTeX
Converting handwritten equations to LaTeX
Stars: ✭ 84 (+110%)
Mutual labels:  sequence-to-sequence
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (+52.5%)
Mutual labels:  sequence-to-sequence
KBRD
Towards Knowledge-Based Recommender Dialog System @ EMNLP 2019
Stars: ✭ 123 (+207.5%)
Mutual labels:  sequence-to-sequence
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-57.5%)
Mutual labels:  sequence-to-sequence
ECG-Heartbeat-Classification-seq2seq-model
Inter- and intra- patient ECG heartbeat classification for arrhythmia detection: a sequence to sequence deep learning approach
Stars: ✭ 125 (+212.5%)
Mutual labels:  sequence-to-sequence
Seq2Seq-Tensorflow
[In-Progress] Tensorflow implementation of Sequence to Sequence Learning with Neural Networks
Stars: ✭ 18 (-55%)
Mutual labels:  sequence-to-sequence
MWPToolkit
MWPToolkit is an open-source framework for math word problem(MWP) solvers.
Stars: ✭ 67 (+67.5%)
Mutual labels:  sequence-to-sequence

Transformer PyTorch implementation

This repository contains Transformer implementation used to translate Korean sentence into English sentence.

I used translation dataset for NMT, but you can apply this model to any sequence to sequence (i.e. text generation) tasks such as text summarization, response generation, ..., etc.

In this project, I specially used Korean-English translation corpus from AI Hub to apply torchtext into Korean dataset.

And I also used soynlp library which is used to tokenize Korean sentence. It is really nice and easy to use, you should try if you plan to handle Korean sentences :)

Currently, the lowest valid and test losses are 2.047 and 3.488 respectively.


Overview

  • Number of train data: 92,000
  • Number of validation data: 11,500
  • Number of test data: 11,500
Example: 
{
  'kor': '['부러진', '날개로', '다시한번', '날개짓을', '하라']',
  'eng': '['wings', 'once', 'again', 'with', 'broken', 'wings']'
}

Requirements

  • Following libraries are fundamental to this repository.
  • You should install PyTorch via official Installation guide.
  • To use spaCy model which is used to tokenize english sentence, download English model by running python -m spacy download en_core_web_sm.
en-core-web-sm==2.1.0
matplotlib==3.1.1
numpy==1.16.4
pandas==0.25.1
scikit-learn==0.21.3
soynlp==0.0.493
spacy==2.1.8
torch==1.2.0
torchtext==0.4.0
  • If you have troble in installing torch and torchtext, do pip install torch==1.2.0 -f https://download.pytorch.org/whl/torch_stable.html and pip install Torchtext==0.04

Usage

  • Before training the model, you should train soynlp tokenizer on your training dataset and build vocabulary using following code.
  • You can determine the size of vocabulary of Korean and English dataset.
  • In general, Korean dataset creates the larger size vocabulary than English dataset. Therefore to make balance, you have to choose proper vocab size
  • By running following code, you will get tokenizer.pickle, kor.pickle and eng.pickle which are used to train, test the model and predict user's input sentence.
python build_pickles.py --kor_vocab KOREAN_VOCAB_SIZE --eng_vocab ENGLISH_VOCAB_SIZE
# in default vocab size
python build_pickles.py
  • For training, run main.py with train mode (which is default option)
python main.py
  • For testing, run main.py with test mode
python main.py --mode test
  • For predicting, run predict.py with your Korean input sentence.
  • Don't forget to wrap your input with double quotation mark !
python predict.py --input "YOUR_KOREAN_INPUT"

Example

kor> 내일 여자친구를 만나러 가요
eng> I am going to meet my girlfriend tomorrow

kor> 감기 조심하세요
eng> Be careful not to catch a cold

To do

  • Add Beam Search for decoding step
  • Add Label Smoothing technique #1, #2, #3

References

Basically, most of my codes are based on original paper. But, I found that there is a difference between original paper and practical implementation in tensor2tensor framework. Then, I fixed some codes to follow practical framework and got better result. For following these change, you should check-out the last reference article.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].