rdspring1 / Pytorch_gbw_lm
Licence: apache-2.0
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset
Stars: ✭ 101
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Pytorch gbw lm
Context
ConText v4: Neural networks for text categorization
Stars: ✭ 120 (+18.81%)
Mutual labels: gpu, lstm
Chinese-Word-Segmentation-in-NLP
State of the art Chinese Word Segmentation with Bi-LSTMs
Stars: ✭ 23 (-77.23%)
Mutual labels: lstm, language-model
Alphaction
Spatio-Temporal Action Localization System
Stars: ✭ 221 (+118.81%)
Mutual labels: gpu, torch
Pytorch Learners Tutorial
PyTorch tutorial for learners
Stars: ✭ 97 (-3.96%)
Mutual labels: lstm, torch
Char Rnn Chinese
Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch. Based on code of https://github.com/karpathy/char-rnn. Support Chinese and other things.
Stars: ✭ 192 (+90.1%)
Mutual labels: lstm, language-model
Awd Lstm Lm
LSTM and QRNN Language Model Toolkit for PyTorch
Stars: ✭ 1,834 (+1715.84%)
Mutual labels: lstm, language-model
Spago
Self-contained Machine Learning and Natural Language Processing library in Go
Stars: ✭ 854 (+745.54%)
Mutual labels: lstm, language-model
Char rnn lm zh
language model in Chinese,基于Pytorch官方文档实现
Stars: ✭ 57 (-43.56%)
Mutual labels: lstm, language-model
Pytorch Cpp
C++ Implementation of PyTorch Tutorials for Everyone
Stars: ✭ 1,014 (+903.96%)
Mutual labels: language-model, torch
Aiopen
AIOpen是一个按人工智能三要素(数据、算法、算力)进行AI开源项目分类的汇集项目,项目致力于跟踪目前人工智能(AI)的深度学习(DL)开源项目,并尽可能地罗列目前的开源项目,同时加入了一些曾经研究过的代码。通过这些开源项目,使初次接触AI的人们对人工智能(深度学习)有更清晰和更全面的了解。
Stars: ✭ 62 (-38.61%)
Mutual labels: gpu, lstm
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (-4.95%)
Mutual labels: lstm
Pynvvl
A Python wrapper of NVIDIA Video Loader (NVVL) with CuPy for fast video loading with Python
Stars: ✭ 95 (-5.94%)
Mutual labels: gpu
Keras Video Classifier
Keras implementation of video classifier
Stars: ✭ 100 (-0.99%)
Mutual labels: lstm
PyTorch Large-Scale Language Model
A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset
Latest Results
- 39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
- Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)
Previous Results
- 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
- Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
- Implemented Sampled Softmax and Log-Uniform Sampler functions
GPU Hardware Requirement
Type | LM Memory Size | GPU |
---|---|---|
w/o tied weights | ~9 GB | Nvidia 1080 TI, Nvidia Titan X |
w/ tied weights [6] | ~7 GB | Nvidia 1070 or higher |
- There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.
Hyper-Parameters [3]
Parameter | Value |
---|---|
# Epochs | 5 |
Training Batch Size | 128 |
Evaluation Batch Size | 1 |
BPTT | 20 |
Embedding Size | 256 |
Hidden Size | 2048 |
Projection Size | 256 |
Tied Embedding + Softmax | False |
# Layers | 1 |
Optimizer | AdaGrad |
Learning Rate | 0.10 |
Gradient Clipping | 1.00 |
Dropout | 0.01 |
Weight-Decay (L2 Penalty) | 1e-6 |
Setup - Torch Data Format
- Download Google Billion Word Dataset for Torch - Link
- Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
- Install Cython framework and build Log_Uniform Sampler
- Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)
I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.
- Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
- Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)
Setup - Original Data Format
- Download 1-Billion Word Dataset - Link
The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.
References
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].