claravania / subword-lstm-lm

Licence: other

LSTM Language Model with Subword Units Input Representations

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to subword-lstm-lm

A toolkit for neural language modeling using Tensorflow including basic models like RNNs and LSTMs as well as more advanced models.

Stars: ✭ 20 (-55.56%)

Mutual labels: language-model

CharLM

Character-aware Neural Language Model implemented by PyTorch

Stars: ✭ 32 (-28.89%)

Mutual labels: language-model

dasher-web

Dasher text entry in HTML, CSS, JavaScript, and SVG

Stars: ✭ 34 (-24.44%)

Mutual labels: language-model

pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Stars: ✭ 132 (+193.33%)

Mutual labels: language-model

asr24

24-hour Automatic Speech Recognition

Stars: ✭ 27 (-40%)

Mutual labels: language-model

gap-text2sql

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Stars: ✭ 83 (+84.44%)

Mutual labels: language-model

Zeroth

Kaldi-based Korean ASR (한국어 음성인식) open-source project

Stars: ✭ 248 (+451.11%)

Mutual labels: language-model

backprop

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Stars: ✭ 229 (+408.89%)

Mutual labels: language-model

calm

Context Aware Language Models

Stars: ✭ 29 (-35.56%)

Mutual labels: language-model

LanguageModel-using-Attention

Pytorch implementation of a basic language model using Attention in LSTM network

Stars: ✭ 27 (-40%)

Mutual labels: language-model

rnn-theano

RNN(LSTM, GRU) in Theano with mini-batch training; character-level language models in Theano

Stars: ✭ 68 (+51.11%)

Mutual labels: language-model

KB-ALBERT

KB국민은행에서 제공하는 경제/금융 도메인에 특화된 한국어 ALBERT 모델

Stars: ✭ 215 (+377.78%)

Mutual labels: language-model

lm-scorer

📃Language Model based sentences scoring library

Stars: ✭ 264 (+486.67%)

Mutual labels: language-model

COCO-LM

[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Stars: ✭ 109 (+142.22%)

Mutual labels: language-model

mlp-gpt-jax

A GPT, made only of MLPs, in Jax

Stars: ✭ 53 (+17.78%)

Mutual labels: language-model

PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Stars: ✭ 151 (+235.56%)

Mutual labels: language-model

personality-prediction

Experiments for automated personality detection using Language Models and psycholinguistic features on various famous personality datasets including the Essays dataset (Big-Five)

Stars: ✭ 109 (+142.22%)

Mutual labels: language-model

machine learning

Stars: ✭ 29 (-35.56%)

Mutual labels: language-model

Highway-Transformer

[ACL‘20] Highway Transformer: A Gated Transformer.

Stars: ✭ 26 (-42.22%)

Mutual labels: language-model

swig-srilm

SWIG Wrapper for the SRILM toolkit

Stars: ✭ 33 (-26.67%)

Mutual labels: language-model

View All Similar Projects ➔

LSTM Language Model with Subword Units Input Representations

This are implementations of various LSTM-based language models using Tensorflow. Codes are based on tensorflow tutorial on building a PTB LSTM model. Some extensions are made to handle input from subword units level, i.e. characters, character ngrams, morpheme segments (i.e. from BPE/Morfessor).

Dependencies

Tensorflow (tested on v0.10.0)
Python 3

Training

Use the script train.py to train a model. Below is an example to train a character bi-LSTM model for English.

python3 train.py --train_file=data/multi/en/train.txt \
					--dev_file=data/multi/en/dev.txt \
					--save_dir=model \
					--unit=char \
					--composition=bi-lstm \
					--rnn_size=200 \
					--batch_size=32 \
					--num_steps=20 \
					--learning_rate=1.0 \
					--decay_rate=0.5 \
					--keep_prob=0.5 \
					--lowercase \
					--SOS=true

Options for units are: char, char-ngram, morpheme (BPE/Morfessor), oracle, and word.
Options for compositions are: none (word only), bi-lstm, and addition.

The morpheme representation uses BPE-like representation. Each word is replaced by its word segments, for example imperfect is written as im@@perfect, where @@ denotes the segment boundary. You can use the segmentation tool provided in here to preprocess your dataset.

In the oracle setting, you need to replace each word in the data with its morphological analysis. For example, in Czech the word Dodavatel is replaced by the following (note that the actual word form is not used for experiment):

word:Dodavatel+lemma:dodavatel+pos:NOUN+Animacy:Anim+Case:Nom+Gender:Masc+Negative:Pos+Number:Sing

Please look at train.py for more hyperparameter options.

Testing

To test a model, run test.py.

python3 test.py --test_file=data/multi/en/test.txt \
					--save_dir=model

Notes

Character-based bi-LSTM model:
"Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation".
http://www.cs.cmu.edu/~lingwang/papers/emnlp2015.pdf

Word segments (BPE) model:
"Neural Machine Translation of Rare Words with Subword Units"
http://www.aclweb.org/anthology/P16-1162

Character ngrams:
The model first segments word into its character ngrams, e.g. cat = ('c', 'a', 't', '^c', 'ca', 'at', 't$). The embedding of the word is computed by summing up all the ngrams embeddings of the word.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

claravania / subword-lstm-lm

Programming Languages

Labels

Projects that are alternatives of or similar to subword-lstm-lm

LSTM Language Model with Subword Units Input Representations

Dependencies

Training

Testing

Notes