All Projects → claravania → subword-lstm-lm

claravania / subword-lstm-lm

Licence: other
LSTM Language Model with Subword Units Input Representations

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to subword-lstm-lm

TF-NNLM-TK
A toolkit for neural language modeling using Tensorflow including basic models like RNNs and LSTMs as well as more advanced models.
Stars: ✭ 20 (-55.56%)
Mutual labels:  language-model
CharLM
Character-aware Neural Language Model implemented by PyTorch
Stars: ✭ 32 (-28.89%)
Mutual labels:  language-model
dasher-web
Dasher text entry in HTML, CSS, JavaScript, and SVG
Stars: ✭ 34 (-24.44%)
Mutual labels:  language-model
pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Stars: ✭ 132 (+193.33%)
Mutual labels:  language-model
asr24
24-hour Automatic Speech Recognition
Stars: ✭ 27 (-40%)
Mutual labels:  language-model
gap-text2sql
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Stars: ✭ 83 (+84.44%)
Mutual labels:  language-model
Zeroth
Kaldi-based Korean ASR (한국어 음성인식) open-source project
Stars: ✭ 248 (+451.11%)
Mutual labels:  language-model
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+408.89%)
Mutual labels:  language-model
calm
Context Aware Language Models
Stars: ✭ 29 (-35.56%)
Mutual labels:  language-model
LanguageModel-using-Attention
Pytorch implementation of a basic language model using Attention in LSTM network
Stars: ✭ 27 (-40%)
Mutual labels:  language-model
rnn-theano
RNN(LSTM, GRU) in Theano with mini-batch training; character-level language models in Theano
Stars: ✭ 68 (+51.11%)
Mutual labels:  language-model
KB-ALBERT
KB국민은행에서 제공하는 경제/금융 도메인에 특화된 한국어 ALBERT 모델
Stars: ✭ 215 (+377.78%)
Mutual labels:  language-model
lm-scorer
📃Language Model based sentences scoring library
Stars: ✭ 264 (+486.67%)
Mutual labels:  language-model
COCO-LM
[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
Stars: ✭ 109 (+142.22%)
Mutual labels:  language-model
mlp-gpt-jax
A GPT, made only of MLPs, in Jax
Stars: ✭ 53 (+17.78%)
Mutual labels:  language-model
PLBART
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
Stars: ✭ 151 (+235.56%)
Mutual labels:  language-model
personality-prediction
Experiments for automated personality detection using Language Models and psycholinguistic features on various famous personality datasets including the Essays dataset (Big-Five)
Stars: ✭ 109 (+142.22%)
Mutual labels:  language-model
ml
machine learning
Stars: ✭ 29 (-35.56%)
Mutual labels:  language-model
Highway-Transformer
[ACL‘20] Highway Transformer: A Gated Transformer.
Stars: ✭ 26 (-42.22%)
Mutual labels:  language-model
swig-srilm
SWIG Wrapper for the SRILM toolkit
Stars: ✭ 33 (-26.67%)
Mutual labels:  language-model

LSTM Language Model with Subword Units Input Representations

This are implementations of various LSTM-based language models using Tensorflow. Codes are based on tensorflow tutorial on building a PTB LSTM model. Some extensions are made to handle input from subword units level, i.e. characters, character ngrams, morpheme segments (i.e. from BPE/Morfessor).

Dependencies

  1. Tensorflow (tested on v0.10.0)
  2. Python 3

Training

Use the script train.py to train a model. Below is an example to train a character bi-LSTM model for English.

python3 train.py --train_file=data/multi/en/train.txt \
					--dev_file=data/multi/en/dev.txt \
					--save_dir=model \
					--unit=char \
					--composition=bi-lstm \
					--rnn_size=200 \
					--batch_size=32 \
					--num_steps=20 \
					--learning_rate=1.0 \
					--decay_rate=0.5 \
					--keep_prob=0.5 \
					--lowercase \
					--SOS=true

Options for units are: char, char-ngram, morpheme (BPE/Morfessor), oracle, and word.
Options for compositions are: none (word only), bi-lstm, and addition.

The morpheme representation uses BPE-like representation. Each word is replaced by its word segments, for example imperfect is written as im@@perfect, where @@ denotes the segment boundary. You can use the segmentation tool provided in here to preprocess your dataset.

In the oracle setting, you need to replace each word in the data with its morphological analysis. For example, in Czech the word Dodavatel is replaced by the following (note that the actual word form is not used for experiment):

word:Dodavatel+lemma:dodavatel+pos:NOUN+Animacy:Anim+Case:Nom+Gender:Masc+Negative:Pos+Number:Sing

Please look at train.py for more hyperparameter options.

Testing

To test a model, run test.py.

python3 test.py --test_file=data/multi/en/test.txt \
					--save_dir=model

Notes

Character-based bi-LSTM model:
"Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation".
http://www.cs.cmu.edu/~lingwang/papers/emnlp2015.pdf

Word segments (BPE) model:
"Neural Machine Translation of Rare Words with Subword Units"
http://www.aclweb.org/anthology/P16-1162

Character ngrams:
The model first segments word into its character ngrams, e.g. cat = ('c', 'a', 't', '^c', 'ca', 'at', 't$). The embedding of the word is computed by summing up all the ngrams embeddings of the word.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].