All Projects → prakhar21 → Textaugmentation Gpt2

prakhar21 / Textaugmentation Gpt2

Licence: mit
Fine-tuned pre-trained GPT2 for custom topic specific text generation. Such system can be used for Text Augmentation.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Textaugmentation Gpt2

Nlp Conference Compendium
Compendium of the resources available from top NLP conferences.
Stars: ✭ 349 (+235.58%)
Mutual labels:  natural-language-processing, nlp-machine-learning, natural-language-generation
Natural Language Processing Specialization
This repo contains my coursework, assignments, and Slides for Natural Language Processing Specialization by deeplearning.ai on Coursera
Stars: ✭ 151 (+45.19%)
Mutual labels:  natural-language-processing, nlp-machine-learning, natural-language-generation
Pqg Pytorch
Paraphrase Generation model using pair-wise discriminator loss
Stars: ✭ 33 (-68.27%)
Mutual labels:  natural-language-processing, natural-language-generation
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-62.5%)
Mutual labels:  natural-language-processing, nlp-machine-learning
Nlg Rl
Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction
Stars: ✭ 59 (-43.27%)
Mutual labels:  natural-language-processing, natural-language-generation
This Word Does Not Exist
This Word Does Not Exist
Stars: ✭ 640 (+515.38%)
Mutual labels:  natural-language-processing, natural-language-generation
Pplm
Plug and Play Language Model implementation. Allows to steer topic and attributes of GPT-2 models.
Stars: ✭ 674 (+548.08%)
Mutual labels:  natural-language-processing, natural-language-generation
Convai Baseline
ConvAI baseline solution
Stars: ✭ 49 (-52.88%)
Mutual labels:  natural-language-processing, natural-language-generation
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+244.23%)
Mutual labels:  natural-language-processing, nlp-machine-learning
Gpt2
PyTorch Implementation of OpenAI GPT-2
Stars: ✭ 64 (-38.46%)
Mutual labels:  natural-language-processing, natural-language-generation
Languagetoys
Random fun with statistical language models.
Stars: ✭ 63 (-39.42%)
Mutual labels:  natural-language-processing, natural-language-generation
Repo 2016
R, Python and Mathematica Codes in Machine Learning, Deep Learning, Artificial Intelligence, NLP and Geolocation
Stars: ✭ 103 (-0.96%)
Mutual labels:  natural-language-processing, nlp-machine-learning
Rnnlg
RNNLG is an open source benchmark toolkit for Natural Language Generation (NLG) in spoken dialogue system application domains. It is released by Tsung-Hsien (Shawn) Wen from Cambridge Dialogue Systems Group under Apache License 2.0.
Stars: ✭ 487 (+368.27%)
Mutual labels:  natural-language-processing, natural-language-generation
Practical Pytorch
Go to https://github.com/pytorch/tutorials - this repo is deprecated and no longer maintained
Stars: ✭ 4,329 (+4062.5%)
Mutual labels:  natural-language-processing, natural-language-generation
Nlg Eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Stars: ✭ 822 (+690.38%)
Mutual labels:  natural-language-processing, natural-language-generation
Question generation
Neural question generation using transformers
Stars: ✭ 356 (+242.31%)
Mutual labels:  natural-language-processing, natural-language-generation
Ludwig
Data-centric declarative deep learning framework
Stars: ✭ 8,018 (+7609.62%)
Mutual labels:  natural-language-processing, natural-language-generation
Lingua
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Stars: ✭ 341 (+227.88%)
Mutual labels:  natural-language-processing, nlp-machine-learning
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-12.5%)
Mutual labels:  natural-language-processing, nlp-machine-learning
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-41.35%)
Mutual labels:  natural-language-processing, nlp-machine-learning

TextAugmentation-GPT2

GPT2 model size representation Fine-tuned pre-trained GPT2 for topic specific text generation. Such system can be used for Text Augmentation.

Getting Started

  1. git clone https://github.com/prakhar21/TextAugmentation-GPT2.git
  2. Move your data to data/ dir.

* Please refer to data/SMSSpamCollection to get the idea of file format.

Tuning for own Corpus

  1. Assuming are done with Point 2 under Getting Started
2. Run python3 train.py --data_file <filename> --epoch <number_of_epochs> --warmup <warmup_steps> --model_name <model_name> --max_len <max_seq_length> --learning_rate <learning_rate> --batch <batch_size>

Generating Text

1. python3 generate.py --model_name <model_name> --sentences <number_of_sentences> --label <class_of_training_data>

* It is recommended that you tune the parameters for your task. Not doing so may result in choosing default parameters and eventually giving sub-optimal performace.

Quick Testing

I had fine-tuned the model on SPAM/HAM dataset. You can download it from here and follow the steps mentioned under Generation Text section.

Sample Results

SPAM: You have 2 new messages. Please call 08719121161 now. £3.50. Limited time offer. Call 090516284580.<|endoftext|>
SPAM: Want to buy a car or just a drink? This week only 800p/text betta...<|endoftext|>
SPAM: FREE Call Todays top players, the No1 players and their opponents and get their opinions on www.todaysplay.co.uk Todays Top Club players are in the draw for a chance to be awarded the £1000 prize. TodaysClub.com<|endoftext|>
SPAM: you have been awarded a £2000 cash prize. call 090663644177 or call 090530663647<|endoftext|>

HAM: Do you remember me?<|endoftext|>
HAM: I don't think so. You got anything else?<|endoftext|>
HAM: Ugh I don't want to go to school.. Cuz I can't go to exam..<|endoftext|>
HAM: K.,k:)where is my laptop?<|endoftext|>

Important Points to Note

  • Top-k and Top-p Sampling (Variant of Nucleus Sampling) has been used while decoding the sequence word-by-word. You can read more about it here

Note: First time you run, it will take considerable amount of time because of the following reasons -

  1. Downloads pre-trained gpt2-medium model (Depends on your Network Speed)
  2. Fine-tunes the gpt2 with your dataset (Depends on size of the data, Epochs, Hyperparameters, etc)

All the experiments were done on IntelDevCloud Machines

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].