All Projects → rdspring1 → Pytorch_gbw_lm

rdspring1 / Pytorch_gbw_lm

Licence: apache-2.0
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pytorch gbw lm

Context
ConText v4: Neural networks for text categorization
Stars: ✭ 120 (+18.81%)
Mutual labels:  gpu, lstm
Chinese-Word-Segmentation-in-NLP
State of the art Chinese Word Segmentation with Bi-LSTMs
Stars: ✭ 23 (-77.23%)
Mutual labels:  lstm, language-model
F Lm
Language Modeling
Stars: ✭ 156 (+54.46%)
Mutual labels:  gpu, language-model
Alphaction
Spatio-Temporal Action Localization System
Stars: ✭ 221 (+118.81%)
Mutual labels:  gpu, torch
Pytorch Learners Tutorial
PyTorch tutorial for learners
Stars: ✭ 97 (-3.96%)
Mutual labels:  lstm, torch
Char Rnn Chinese
Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch. Based on code of https://github.com/karpathy/char-rnn. Support Chinese and other things.
Stars: ✭ 192 (+90.1%)
Mutual labels:  lstm, language-model
torch-lrcn
An implementation of the LRCN in Torch
Stars: ✭ 85 (-15.84%)
Mutual labels:  torch, lstm
Awd Lstm Lm
LSTM and QRNN Language Model Toolkit for PyTorch
Stars: ✭ 1,834 (+1715.84%)
Mutual labels:  lstm, language-model
Paperspace Python
Paperspace API for python
Stars: ✭ 38 (-62.38%)
Mutual labels:  gpu, torch
Spago
Self-contained Machine Learning and Natural Language Processing library in Go
Stars: ✭ 854 (+745.54%)
Mutual labels:  lstm, language-model
Digits
Deep Learning GPU Training System
Stars: ✭ 4,056 (+3915.84%)
Mutual labels:  gpu, torch
Char rnn lm zh
language model in Chinese,基于Pytorch官方文档实现
Stars: ✭ 57 (-43.56%)
Mutual labels:  lstm, language-model
Pytorch Cpp
C++ Implementation of PyTorch Tutorials for Everyone
Stars: ✭ 1,014 (+903.96%)
Mutual labels:  language-model, torch
Aiopen
AIOpen是一个按人工智能三要素(数据、算法、算力)进行AI开源项目分类的汇集项目,项目致力于跟踪目前人工智能(AI)的深度学习(DL)开源项目,并尽可能地罗列目前的开源项目,同时加入了一些曾经研究过的代码。通过这些开源项目,使初次接触AI的人们对人工智能(深度学习)有更清晰和更全面的了解。
Stars: ✭ 62 (-38.61%)
Mutual labels:  gpu, lstm
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (-4.95%)
Mutual labels:  lstm
Text predictor
Char-level RNN LSTM text generator📄.
Stars: ✭ 99 (-1.98%)
Mutual labels:  lstm
Pynvvl
A Python wrapper of NVIDIA Video Loader (NVVL) with CuPy for fast video loading with Python
Stars: ✭ 95 (-5.94%)
Mutual labels:  gpu
Nvfancontrol
NVidia dynamic fan control for Linux and Windows
Stars: ✭ 93 (-7.92%)
Mutual labels:  gpu
Keras Video Classifier
Keras implementation of video classifier
Stars: ✭ 100 (-0.99%)
Mutual labels:  lstm
Kmeans pytorch
kmeans using PyTorch
Stars: ✭ 98 (-2.97%)
Mutual labels:  gpu

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

  • 39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
  • Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

  • 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
  • Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
  • Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type LM Memory Size GPU
w/o tied weights ~9 GB Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6] ~7 GB Nvidia 1070 or higher
  • There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter Value
# Epochs 5
Training Batch Size 128
Evaluation Batch Size 1
BPTT 20
Embedding Size 256
Hidden Size 2048
Projection Size 256
Tied Embedding + Softmax False
# Layers 1
Optimizer AdaGrad
Learning Rate 0.10
Gradient Clipping 1.00
Dropout 0.01
Weight-Decay (L2 Penalty) 1e-6

Setup - Torch Data Format

  1. Download Google Billion Word Dataset for Torch - Link
  2. Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
  3. Install Cython framework and build Log_Uniform Sampler
  4. Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

  • Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
  • Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

  1. Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

  1. Exploring the Limits of Language Modeling Github
  2. Factorization Tricks for LSTM networks Github
  3. Efficient softmax approximation for GPUs Github
  4. Candidate Sampling
  5. Torch GBW
  6. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].