All Projects → minimaxir → Gpt 2 Simple

minimaxir / Gpt 2 Simple

Licence: other
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Gpt 2 Simple

Dialogpt
Large-scale pretraining for dialogue
Stars: ✭ 1,177 (-58.47%)
Mutual labels:  text-generation
Kobe
Source code and dataset for KDD 2019 paper "Towards Knowledge-Based Personalized Product Description Generation in E-commerce"
Stars: ✭ 148 (-94.78%)
Mutual labels:  text-generation
Texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 2,236 (-21.1%)
Mutual labels:  text-generation
Delta
DELTA is a deep learning based natural language and speech processing platform.
Stars: ✭ 1,479 (-47.81%)
Mutual labels:  text-generation
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (-94.95%)
Mutual labels:  text-generation
Ctrl Gce
Set up the CTRL text-generating model on Google Compute Engine with just a few console commands.
Stars: ✭ 154 (-94.57%)
Mutual labels:  text-generation
Market Reporter
Automatic Generation of Brief Summaries of Time-Series Data
Stars: ✭ 54 (-98.09%)
Mutual labels:  text-generation
Attention Mechanisms
Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.
Stars: ✭ 203 (-92.84%)
Mutual labels:  text-generation
Guyu
pre-training and fine-tuning framework for text generation
Stars: ✭ 144 (-94.92%)
Mutual labels:  text-generation
Download Tweets Ai Text Gen
Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation.
Stars: ✭ 182 (-93.58%)
Mutual labels:  text-generation
Kadot
Kadot, the unsupervised natural language processing library.
Stars: ✭ 108 (-96.19%)
Mutual labels:  text-generation
Kogpt2 Finetuning
🔥 Korean GPT-2, KoGPT2 FineTuning cased. 한국어 가사 데이터 학습 🔥
Stars: ✭ 124 (-95.62%)
Mutual labels:  text-generation
Vae Lagging Encoder
PyTorch implementation of "Lagging Inference Networks and Posterior Collapse in Variational Autoencoders" (ICLR 2019)
Stars: ✭ 153 (-94.6%)
Mutual labels:  text-generation
Gpt2 Chitchat
GPT2 for Chinese chitchat/用于中文闲聊的GPT2模型(实现了DialoGPT的MMI思想)
Stars: ✭ 1,230 (-56.6%)
Mutual labels:  text-generation
Crslab
CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Stars: ✭ 183 (-93.54%)
Mutual labels:  text-generation
Markov
A generic markov chain implementation in Rust.
Stars: ✭ 59 (-97.92%)
Mutual labels:  text-generation
Nlp pytorch project
Embedding, NMT, Text_Classification, Text_Generation, NER etc.
Stars: ✭ 153 (-94.6%)
Mutual labels:  text-generation
Commit Autosuggestions
A tool that AI automatically recommends commit messages.
Stars: ✭ 207 (-92.7%)
Mutual labels:  text-generation
Deep Generative Models For Natural Language Processing
DGMs for NLP. A roadmap.
Stars: ✭ 185 (-93.47%)
Mutual labels:  text-generation
Gpt 2 Tensorflow2.0
OpenAI GPT2 pre-training and sequence prediction implementation in Tensorflow 2.0
Stars: ✭ 172 (-93.93%)
Mutual labels:  text-generation

gpt-2-simple

gen_demo

A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifically the "small" 124M and "medium" 355M hyperparameter versions). Additionally, this package allows easier generation of text, generating to a file for easy curation, allowing for prefixes to force the text to start with a given phrase.

This package incorporates and makes minimal low-level changes to:

  • Model management from OpenAI's official GPT-2 repo (MIT License)
  • Model finetuning from Neil Shepperd's fork of GPT-2 (MIT License)
  • Text generation output management from textgenrnn (MIT License / also created by me)

For finetuning, it is strongly recommended to use a GPU, although you can generate using a CPU (albeit much more slowly). If you are training in the cloud, using a Colaboratory notebook or a Google Compute Engine VM w/ the TensorFlow Deep Learning image is strongly recommended. (as the GPT-2 model is hosted on GCP)

You can use gpt-2-simple to retrain a model using a GPU for free in this Colaboratory notebook, which also demos additional features of the package.

Note: Development on gpt-2-simple has mostly been superceded by aitextgen, which has similar AI text generation capabilities with more efficient training time and resource usage. If you do not require using TensorFlow, I recommend using aitextgen instead. Checkpoints trained using gpt-2-simple can be loaded using aitextgen as well.

Install

gpt-2-simple can be installed via PyPI:

pip3 install gpt-2-simple

You will also need to install the corresponding TensorFlow 2.X version (min 2.5.1) for your system (e.g. tensorflow or tensorflow-gpu).

Usage

An example for downloading the model to the local system, finetuning it on a dataset. and generating some text.

Warning: the pretrained 124M model, and thus any finetuned model, is 500 MB! (the pretrained 355M model is 1.5 GB)

import gpt_2_simple as gpt2
import os
import requests

model_name = "124M"
if not os.path.isdir(os.path.join("models", model_name)):
	print(f"Downloading {model_name} model...")
	gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/124M/


file_name = "shakespeare.txt"
if not os.path.isfile(file_name):
	url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
	data = requests.get(url)

	with open(file_name, 'w') as f:
		f.write(data.text)


sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              file_name,
              model_name=model_name,
              steps=1000)   # steps is max number of training steps

gpt2.generate(sess)

The generated model checkpoints are by default in /checkpoint/run1. If you want to load a model from that folder and generate text from it:

import gpt_2_simple as gpt2

sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)

gpt2.generate(sess)

As with textgenrnn, you can generate and save text for later use (e.g. an API or a bot) by using the return_as_list parameter.

single_text = gpt2.generate(sess, return_as_list=True)[0]
print(single_text)

You can pass a run_name parameter to finetune and load_gpt2 if you want to store/load multiple models in a checkpoint folder.

There is also a command-line interface for both finetuning and generation with strong defaults for just running on a Cloud VM w/ GPU. For finetuning (which will also download the model if not present):

gpt_2_simple finetune shakespeare.txt

And for generation, which generates texts to files in a gen folder:

gpt_2_simple generate

Most of the same parameters available in the functions are available as CLI arguments, e.g.:

gpt_2_simple generate --temperature 1.0 --nsamples 20 --batch_size 20 --length 50 --prefix "<|startoftext|>" --truncate "<|endoftext|>" --include_prefix False --nfiles 5

See below to see what some of the CLI arguments do.

NB: Restart the Python session first if you want to finetune on another dataset or load another model.

Differences Between gpt-2-simple And Other Text Generation Utilities

The method GPT-2 uses to generate text is slightly different than those like other packages like textgenrnn (specifically, generating the full text sequence purely in the GPU and decoding it later), which cannot easily be fixed without hacking the underlying model code. As a result:

  • In general, GPT-2 is better at maintaining context over its entire generation length, making it good for generating conversational text. The text is also generally gramatically correct, with proper capitalization and few typoes.
  • The original GPT-2 model was trained on a very large variety of sources, allowing the model to incorporate idioms not seen in the input text.
  • GPT-2 can only generate a maximum of 1024 tokens per request (about 3-4 paragraphs of English text).
  • GPT-2 cannot stop early upon reaching a specific end token. (workaround: pass the truncate parameter to a generate function to only collect text until a specified end token. You may want to reduce length appropriately.)
  • Higher temperatures work better (e.g. 0.7 - 1.0) to generate more interesting text, while other frameworks work better between 0.2 - 0.5.
  • When finetuning GPT-2, it has no sense of the beginning or end of a document within a larger text. You'll need to use a bespoke character sequence to indicate the beginning and end of a document. Then while generating, you can specify a prefix targeting the beginning token sequences, and a truncate targeting the end token sequence. You can also set include_prefix=False to discard the prefix token while generating (e.g. if it's something unwanted like <|startoftext|>).
  • If you pass a single-column .csv file to finetune(), it will automatically parse the CSV into a format ideal for training with GPT-2 (including prepending <|startoftext|> and suffixing <|endoftext|> to every text document, so the truncate tricks above are helpful when generating output). This is necessary to handle both quotes and newlines in each text document correctly.
  • GPT-2 allows you to generate texts in parallel by setting a batch_size that is divisible into nsamples, resulting in much faster generation. Works very well with a GPU (can set batch_size up to 20 on Colaboratory's K80)!
  • Due to GPT-2's architecture, it scales up nicely with more powerful GPUs. For the 124M model, if you want to train for longer periods of time, GCP's P100 GPU is about 3x faster than a K80/T4 for only 3x the price, making it price-comparable (the V100 is about 1.5x faster than the P100 but about 2x the price). The P100 uses 100% of the GPU even with batch_size=1, and about 88% of the V100 GPU.
  • If you have a partially-trained GPT-2 model and want to continue finetuning it, you can set overwrite=True to finetune, which will continue training and remove the previous iteration of the model without creating a duplicate copy. This can be especially useful for transfer learning (e.g. heavily finetune GPT-2 on one dataset, then finetune on other dataset to get a "merging" of both datasets).
  • If your input text dataset is massive (>100 MB), you may want to preencode and compress the dataset using gpt2.encode_dataset(file_path). THe output is a compressed .npz file which will load much faster into the GPU for finetuning.
  • The 774M "large" model may support finetuning because it will cause modern GPUs to go out-of-memory (you may get lucky if you use a P100 GPU on Colaboratory). However, you can still generate from the default pretrained model using gpt2.load_gpt2(sess, model_name='774M') and gpt2.generate(sess, model_name='774M').
  • The 1558M "extra large", true model, may not work out-of-the-box with the GPU included with the Colaboratory Notebook. More testing is needed to identify optimial configurations for it.

Interactive Apps Using gpt-2-simple

  • gpt2-small — App using the default GPT-2 124M pretrained model
  • gpt2-reddit — App to generate Reddit titles based on a specified subreddit and/or keyword(s)
  • gpt2-mtg — App to generate Magic: The Gathering cards

Text Generation Examples Using gpt-2-simple

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

Disclaimer

This repo has no affiliation or relationship with OpenAI.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].