Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch. Based on code of https://github.com/karpathy/char-rnn. Support Chinese and other things.

Stars: ✭ 192 (-5.88%)

Mutual labels: language-model

Lotclass

[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach

Stars: ✭ 160 (-21.57%)

Mutual labels: language-model

Gpt Neo

An implementation of model parallel GPT2& GPT3-like models, with the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow library.

Stars: ✭ 1,252 (+513.73%)

Mutual labels: language-model

Bert Sklearn

a sklearn wrapper for Google's BERT model

Stars: ✭ 182 (-10.78%)

Mutual labels: language-model

Speecht

An opensource speech-to-text software written in tensorflow

Stars: ✭ 152 (-25.49%)

Mutual labels: language-model

Lingvo

Stars: ✭ 2,361 (+1057.35%)

Mutual labels: language-model

Macbert

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP)

Stars: ✭ 167 (-18.14%)

Mutual labels: language-model

Automatic Speech Recognition

🎧 Automatic Speech Recognition: DeepSpeech & Seq2Seq (TensorFlow)

Stars: ✭ 192 (-5.88%)

Mutual labels: language-model

Lazynlp

Library to scrape and clean web pages to create massive datasets.

Stars: ✭ 1,985 (+873.04%)

Mutual labels: language-model

Indic Bert

BERT-based Multilingual Model for Indian Languages

Stars: ✭ 160 (-21.57%)

Mutual labels: language-model

Bert As Language Model

bert as language model, fork from https://github.com/google-research/bert

Stars: ✭ 185 (-9.31%)

Mutual labels: language-model

Keras Xlnet

Implementation of XLNet that can load pretrained checkpoints

Stars: ✭ 159 (-22.06%)

Mutual labels: language-model

Gpt Scrolls

A collaborative collection of open-source safe GPT-3 prompts that work well

Stars: ✭ 195 (-4.41%)

Mutual labels: language-model

Transformer Lm

Transformer language model (GPT-2) with sentencepiece tokenizer

Stars: ✭ 154 (-24.51%)

Mutual labels: language-model

Keras Bert

Implementation of BERT that could load official pre-trained models for feature extraction and prediction

Stars: ✭ 2,264 (+1009.8%)

Mutual labels: language-model

Attention Mechanisms

Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.

Stars: ✭ 203 (-0.49%)

Mutual labels: language-model

Protein Sequence Embedding Iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019

Stars: ✭ 194 (-4.9%)

Mutual labels: language-model

Nlp learning

结合python一起学习自然语言处理 (nlp): 语言模型、HMM、PCFG、Word2vec、完形填空式阅读理解任务、朴素贝叶斯分类器、TFIDF、PCA、SVD

Stars: ✭ 188 (-7.84%)

Mutual labels: language-model

View All Similar Projects ➔

An NCE implementation in pytorch

About NCE

Noise Contrastive Estimation (NCE) is an approximation method that is used to work around the huge computational cost of large softmax layer. The basic idea is to convert the prediction problem into classification problem at training stage. It has been proved that these two criterions converges to the same minimal point as long as noise distribution is close enough to real one.

NCE bridges the gap between generative models and discriminative models, rather than simply speedup the softmax layer. With NCE, you can turn almost anything into posterior with less effort (I think).

Refs:

NCE:

http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf

NCE on rnnlm:

https://pdfs.semanticscholar.org/144e/357b1339c27cce7a1e69f0899c21d8140c1f.pdf

Comparison with other methods

A review of softmax speedup methods:

http://ruder.io/word-embeddings-softmax/

NCE vs. IS (Importance Sampling): Nce is a binary classification while IS is sort of multi-class classification problem.

http://demo.clab.cs.cmu.edu/cdyer/nce_notes.pdf

NCE vs. GAN (Generative Adversarial Network):

https://arxiv.org/abs/1412.6515

On improving NCE

Sampling methods

In NCE, unigram distribution is usually used to approximate the noise distribution because it's fast to sample from. Sampling from a unigram is equal to multinomial sampling, which is of complexity $O(\log(N))$ via binary search tree. The cost of sampling becomes significant when noise ratio increases.

Since the unigram distribution can be obtained before training and remains unchanged across training, some works are proposed to make use of this property to speedup the sampling procedure. Alias method is one of them.

diagram of constructing auxiliary data structure

By constructing data structures, alias method can reduce the sampling complexity from $O(log(N))$ to $O(1)$, and it's easy to parallelize.

Refs:

alias method:

https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/

Generic NCE (full-NCE)

Conventional NCE only perform the contrasting on linear(softmax) layer, that is, given an input of a linear layer, the model outputs are $p(noise|input)$ and $p(target|input)$. In fact NCE can be applied to more general situations where models are capable to output likelihood values for both real data and noise data.

In this code base, I use a variant of generic NCE named full-NCE (f-NCE) to clarify. Unlike normal NCE, f-NCE samples the noises at input embedding.

Refs:

whole sentence language model by IBM (ICASSP2018)

Bi-LSTM language model by speechlab,SJTU (ICSLP2016?)

Batched NCE

Conventional NCE requires different noise samples per data token. Such computational pattern is not fully GPU-efficient because it needs batched matrix multiplication. A trick is to share the noise samples across the whole mini-batch, thus sparse batched matrix multiplication is converted to more efficient dense matrix multiplication. The batched NCE is already supported by Tensorflow.

A more aggressive approach is to called self contrasting (named by myself). Instead of sampling from noise distribution, the noises are simply the other training tokens the within the same mini-batch.

Ref:

batched NCE

https://arxiv.org/pdf/1708.05997.pdf

self contrasting:

https://www.isi.edu/natural-language/mt/simple-fast-noise.pdf

Run the word language model example

There's an example illustrating how to use the NCE module in example folder. This example is forked from the pytorch/examples repo.

Requirements

Please run pip install -r requirements first to see if you have the required python lib.

tqdm is used for process bar during training
dill is a more flexible replacement for pickle

NCE related Arguments

--nce: whether to use NCE as approximation
--noise-ratio <50>: numbers of noise samples per batch, the noise is shared among the tokens in a single batch, for training speed.
--norm-term <9>: the constant normalization term Ln(z)
--index-module <linear>: index module to use for NCE module (currently and available, does not support PPL calculating )
--train: train or just evaluation existing model
--vocab <None>: use vocabulary file if specified, otherwise use the words in train.txt
--loss [full, nce, sampled, mix]: choose one of the loss type for training, the loss is converted to full for PPL evaluation automatically.

Examples

Run NCE criterion with linear module:

python main.py --cuda --noise-ratio 10 --norm-term 9 --nce --train

Run NCE criterion with gru module:

python main.py --cuda --noise-ratio 10 --norm-term 9 --nce --train --index-module gru

Run conventional CE criterion:

python main.py --cuda --train

A small benchmark in swbd+fisher dataset

It's a performance showcase. The dataset is not bundled in this repo however. The model is trained on concatenated sentences,but the hidden states are not passed across batches. An <s> is inserted between sentences. The model is evaluated on <s> padded sentences separately.

Generally a model trained on concatenated sentences performs slightly worse than the one trained on separate sentences. But we saves 50% of training time by reducing the sentence padding operation.

dataset statistics

training samples: 2200000 sentences, 22403872 words
built vocabulary size: ~30K

testbed

1080 Ti
i7 7700K
pytorch-0.4.0
cuda-8.0
cudnn-6.0.1

how to run:

python main.py --train --batch-size 96 --cuda --loss nce --noise-ratio 500 --nhid 300 \
  --emsize 300 --log-interval 1000 --nlayers 1 --dropout 0 --weight-decay 1e-8 \
  --data data/swb --min-freq 3 --lr 2 --save nce-500-swb --concat

Running time

crossentropy: 6.5 mins/epoch (56K tokens/sec)
nce: 2 mins/epoch (187K tokens/sec)

performance

The rescore is performed on swbd 50-best, thanks to HexLee.

training loss type	evaluation type	PPL	WER
3gram	normed	??	19.4
CE(no concat)	normed(full)	53	13.1
CE	normed(full)	55	13.3
NCE	unnormed(NCE)	invalid	13.4
NCE	normed(full)	55	13.4
importance sample	normed(full)	55	13.4
importance sample	sampled(500)	invalid	19.0(worse than w/o rescore)

File structure

example/log/: some log files of this scripts
nce/: the NCE module wrapper
- nce/nce_loss.py: the NCE loss
- nce/alias_multinomial.py: alias method sampling
- nce/index_linear.py: an index module used by NCE, as a replacement for normal Linear module
- nce/index_gru.py: an index module used by NCE, as a replacement for the whole language model module
sample.py: a simple script for NCE linear.
example: a word langauge model sample to use NCE as loss.
- example/vocab.py: a wrapper for vocabulary object
- example/model.py: the wrapper of all nn.Modules.
- example/generic_model.py: the model wrapper for index_gru NCE module
- example/main.py: entry point
- example/utils.py: some util functions for better code structure

Modified README from Pytorch/examples

This example trains a multi-layer LSTM on a language modeling task. By default, the training script uses the PTB dataset, provided.

python main.py --train --cuda --epochs 6        # Train a LSTM on PTB with CUDA

The model will automatically use the cuDNN backend if run on CUDA with cuDNN installed.

During training, if a keyboard interrupt (Ctrl-C) is received, training is stopped and the current model is evaluated against the test dataset.

The main.py script accepts the following arguments:

optional arguments:
  -h, --help         show this help message and exit
  --data DATA        location of the data corpus
  --emsize EMSIZE    size of word embeddings
  --nhid NHID        humber of hidden units per layer
  --nlayers NLAYERS  number of layers
  --lr LR            initial learning rate
  --lr-decay         learning rate decay when no progress is observed on validation set
  --weight-decay     weight decay(L2 normalization)
  --clip CLIP        gradient clipping
  --epochs EPOCHS    upper epoch limit
  --batch-size N     batch size
  --dropout DROPOUT  dropout applied to layers (0 = no dropout)
  --seed SEED        random seed
  --cuda             use CUDA
  --log-interval N   report interval
  --save SAVE        path to save the final model
  --bptt             max length of truncated bptt
  --concat           use concatenated sentence instead of individual sentence

CHANGELOG

2019.09.09: Improve numeric stability by directly calculation on logits

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 204

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗