Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Automatic Speech Recognition (ASR), Speaker Verification, Speech Synthesis, Text-to-Speech (TTS), Language Modelling, Singing Voice Synthesis (SVS), Voice Conversion (VC)

Stars: ✭ 2,085 (+1253.9%)

Mutual labels: language-model

Getlang

Natural language detection package in pure Go

Stars: ✭ 110 (-28.57%)

Mutual labels: language-model

Robbert

A Dutch RoBERTa-based language model

Stars: ✭ 120 (-22.08%)

Mutual labels: language-model

Electra

中文预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model

Stars: ✭ 132 (-14.29%)

Mutual labels: language-model

Pytorch gbw lm

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Stars: ✭ 101 (-34.42%)

Mutual labels: language-model

Electra pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)

Stars: ✭ 149 (-3.25%)

Mutual labels: language-model

Dynamic Memory Networks Plus Pytorch

Implementation of Dynamic memory networks plus in Pytorch

Stars: ✭ 123 (-20.13%)

Mutual labels: language-model

Ld Net

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Stars: ✭ 148 (-3.9%)

Mutual labels: language-model

Keras Gpt 2

Load GPT-2 checkpoint and generate texts

Stars: ✭ 113 (-26.62%)

Mutual labels: language-model

Haystack

🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.

Stars: ✭ 3,409 (+2113.64%)

Mutual labels: language-model

Clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Stars: ✭ 2,425 (+1474.68%)

Mutual labels: language-model

Transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+36096.1%)

Mutual labels: language-model

Awd Lstm Lm

LSTM and QRNN Language Model Toolkit for PyTorch

Stars: ✭ 1,834 (+1090.91%)

Mutual labels: language-model

Openseq2seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Stars: ✭ 1,378 (+794.81%)

Mutual labels: language-model

Chars2vec

Character-based word embeddings model based on RNN for handling real world texts

Stars: ✭ 130 (-15.58%)

Mutual labels: language-model

Speecht

An opensource speech-to-text software written in tensorflow

Stars: ✭ 152 (-1.3%)

Mutual labels: language-model

Awesome Sentence Embedding

A curated list of pretrained sentence and word embedding models

Stars: ✭ 1,973 (+1181.17%)

Mutual labels: language-model

Tupe

Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.

Stars: ✭ 143 (-7.14%)

Mutual labels: language-model

View All Similar Projects ➔

Training GPT-2 transformer language model with sentencepiece tokenizer

.. image:: https://img.shields.io/travis/lopuhin/transformer-lm/master.svg :target: https://travis-ci.org/lopuhin/transformer-lm :alt: Build Status

Training GPT-2 transformer language model on your own corpora with sentencepiece <https://github.com/google/sentencepiece>_ tokenization.

This repo contains a PyTorch implementation of GPT-2, which support multi-GPU training. It also contains a TensorFlow implementation in lm/gpt_2_tf, but it is not developed any more. They share the same data preparation scripts. TF training command is gpt-2-tf-train and needs TensorFlow 1.13. Documentation below is for PyTorch version.

.. contents::

Installation

Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. Install <https://pytorch.org/get-started/locally/>__ appropriate version of pytorch first, and then::

pip install -r requirements.txt
python setup.py develop

Usage

Instructions are below. See also test/test_shakespeare.sh for a complete pipeline demo on a small corpus (takes a minute on a CPU).

Prepare data for training +++++++++++++++++++++++++

Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly)::

sp-train data/corpora-* sp-text.txt sp-model
Encode corpora, producing numpy files::

sp-encode data/corpora-* sp-model.model data/encoded

Training ++++++++

Example command::

gpt-2 run-root data/encoded sp-model.model

run-root would contain model checkpoints and json-lines logs, which can be plotted in a jupyter notebook with json_log_plots.plot("run-root"), with number of tokens seen on the X axis.

Default hyperparameters correspond to released "small" GPT-2 model.

When multiple GPUs are available, they would be used for training with the help of torch.distributed.

If the path exists and --clean key is NOT passed, training would be resumed. Note that all parameters still need to be specified and model parameters need to match.

Notes on training parameters:

--batch-size is per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory.
--g-accum-gradients is the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always batch_size * g_accum_gradients.
--lr does not need to be changed when changing --batch-size or --g-accum-gradients or number of GPUs or --n-ctx: loss is already scaled appropriately.

Inference +++++++++

Example command::

gpt-2-gen run-root "Artificial intelligence"

run-root would contain model checkpoints "Artificial intelligence" is the text prefix used as a starting point for generating tokens

Notes on inference parameters:

--tokens-to-generate: number of tokens to generate, default is 42
--top-k: number of token candidates to generate for each position (beam width), default is 8.

License & credits

License is MIT.

TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py

PyTorch port is based on original OpenAI code.

Test Shakespeare corpus under tests/shakespeare is from http://shakespeare.mit.edu under public domain.

See also OpenAI GPT-2 paper <https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf>_ and blog <https://openai.com/blog/better-language-models/>_.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 154

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗