All Projects → lopuhin → Transformer Lm

lopuhin / Transformer Lm

Transformer language model (GPT-2) with sentencepiece tokenizer

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Transformer Lm

Easy Bert
A Dead Simple BERT API for Python and Java (https://github.com/google-research/bert)
Stars: ✭ 106 (-31.17%)
Mutual labels:  language-model
Kogpt2 Finetuning
🔥 Korean GPT-2, KoGPT2 FineTuning cased. 한국어 가사 데이터 학습 🔥
Stars: ✭ 124 (-19.48%)
Mutual labels:  language-model
Awesome Speech Recognition Speech Synthesis Papers
Automatic Speech Recognition (ASR), Speaker Verification, Speech Synthesis, Text-to-Speech (TTS), Language Modelling, Singing Voice Synthesis (SVS), Voice Conversion (VC)
Stars: ✭ 2,085 (+1253.9%)
Mutual labels:  language-model
Getlang
Natural language detection package in pure Go
Stars: ✭ 110 (-28.57%)
Mutual labels:  language-model
Robbert
A Dutch RoBERTa-based language model
Stars: ✭ 120 (-22.08%)
Mutual labels:  language-model
Electra
中文 预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model
Stars: ✭ 132 (-14.29%)
Mutual labels:  language-model
Pytorch gbw lm
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset
Stars: ✭ 101 (-34.42%)
Mutual labels:  language-model
Electra pytorch
Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
Stars: ✭ 149 (-3.25%)
Mutual labels:  language-model
Dynamic Memory Networks Plus Pytorch
Implementation of Dynamic memory networks plus in Pytorch
Stars: ✭ 123 (-20.13%)
Mutual labels:  language-model
Ld Net
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Stars: ✭ 148 (-3.9%)
Mutual labels:  language-model
Keras Gpt 2
Load GPT-2 checkpoint and generate texts
Stars: ✭ 113 (-26.62%)
Mutual labels:  language-model
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+2113.64%)
Mutual labels:  language-model
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1474.68%)
Mutual labels:  language-model
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+36096.1%)
Mutual labels:  language-model
Awd Lstm Lm
LSTM and QRNN Language Model Toolkit for PyTorch
Stars: ✭ 1,834 (+1090.91%)
Mutual labels:  language-model
Openseq2seq
Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Stars: ✭ 1,378 (+794.81%)
Mutual labels:  language-model
Chars2vec
Character-based word embeddings model based on RNN for handling real world texts
Stars: ✭ 130 (-15.58%)
Mutual labels:  language-model
Speecht
An opensource speech-to-text software written in tensorflow
Stars: ✭ 152 (-1.3%)
Mutual labels:  language-model
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+1181.17%)
Mutual labels:  language-model
Tupe
Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.
Stars: ✭ 143 (-7.14%)
Mutual labels:  language-model

Training GPT-2 transformer language model with sentencepiece tokenizer

.. image:: https://img.shields.io/travis/lopuhin/transformer-lm/master.svg :target: https://travis-ci.org/lopuhin/transformer-lm :alt: Build Status

Training GPT-2 transformer language model on your own corpora with sentencepiece <https://github.com/google/sentencepiece>_ tokenization.

This repo contains a PyTorch implementation of GPT-2, which support multi-GPU training. It also contains a TensorFlow implementation in lm/gpt_2_tf, but it is not developed any more. They share the same data preparation scripts. TF training command is gpt-2-tf-train and needs TensorFlow 1.13. Documentation below is for PyTorch version.

.. contents::

Installation

Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. Install <https://pytorch.org/get-started/locally/>__ appropriate version of pytorch first, and then::

pip install -r requirements.txt
python setup.py develop

Usage

Instructions are below. See also test/test_shakespeare.sh for a complete pipeline demo on a small corpus (takes a minute on a CPU).

Prepare data for training +++++++++++++++++++++++++

Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

  1. Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly)::

    sp-train data/corpora-* sp-text.txt sp-model

  2. Encode corpora, producing numpy files::

    sp-encode data/corpora-* sp-model.model data/encoded

Training ++++++++

Example command::

gpt-2 run-root data/encoded sp-model.model

run-root would contain model checkpoints and json-lines logs, which can be plotted in a jupyter notebook with json_log_plots.plot("run-root"), with number of tokens seen on the X axis.

Default hyperparameters correspond to released "small" GPT-2 model.

When multiple GPUs are available, they would be used for training with the help of torch.distributed.

If the path exists and --clean key is NOT passed, training would be resumed. Note that all parameters still need to be specified and model parameters need to match.

Notes on training parameters:

  • --batch-size is per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory.
  • --g-accum-gradients is the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always batch_size * g_accum_gradients.
  • --lr does not need to be changed when changing --batch-size or --g-accum-gradients or number of GPUs or --n-ctx: loss is already scaled appropriately.

Inference +++++++++

Example command::

gpt-2-gen run-root "Artificial intelligence"

run-root would contain model checkpoints "Artificial intelligence" is the text prefix used as a starting point for generating tokens

Notes on inference parameters:

  • --tokens-to-generate: number of tokens to generate, default is 42
  • --top-k: number of token candidates to generate for each position (beam width), default is 8.

License & credits

License is MIT.

TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py

PyTorch port is based on original OpenAI code.

Test Shakespeare corpus under tests/shakespeare is from http://shakespeare.mit.edu under public domain.

See also OpenAI GPT-2 paper <https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf>_ and blog <https://openai.com/blog/better-language-models/>_.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].