All Projects → Andrew03 → transformer-abstractive-summarization

Andrew03 / transformer-abstractive-summarization

Licence: other
Code for the paper "Efficient Adaption of Pretrained Transformers for Abstractive Summarization"

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to transformer-abstractive-summarization

TS3000 TheChatBOT
Its a social networking chat-bot trained on Reddit dataset . It supports open bounded queries developed on the concept of Neural Machine Translation. Beware of its being sarcastic just like its creator 😝 BDW it uses Pytorch framework and Python3.
Stars: ✭ 20 (-70.59%)
Mutual labels:  pytorch-nlp
Awesome-Pytorch-Tutorials
Awesome Pytorch Tutorials
Stars: ✭ 23 (-66.18%)
Mutual labels:  pytorch-nlp
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (-33.82%)
Mutual labels:  pytorch-nlp
py-lingualytics
A text analytics library with support for codemixed data
Stars: ✭ 36 (-47.06%)
Mutual labels:  pytorch-nlp
Entity2Topic
[NAACL2018] Entity Commonsense Representation for Neural Abstractive Summarization
Stars: ✭ 20 (-70.59%)
Mutual labels:  document-summarization
Intelligent Document Finder
Document Search Engine Tool
Stars: ✭ 45 (-33.82%)
Mutual labels:  document-summarization
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+4619.12%)
Mutual labels:  pytorch-nlp
Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (+2835.29%)
Mutual labels:  pytorch-nlp
Pytorchdocs
PyTorch 官方中文教程包含 60 分钟快速入门教程,强化教程,计算机视觉,自然语言处理,生成对抗网络,强化学习。欢迎 Star,Fork!
Stars: ✭ 1,705 (+2407.35%)
Mutual labels:  pytorch-nlp
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+4926.47%)
Mutual labels:  pytorch-nlp
pytorch-transformer-chatbot
PyTorch v1.2에서 생긴 Transformer API 를 이용한 간단한 Chitchat 챗봇
Stars: ✭ 44 (-35.29%)
Mutual labels:  pytorch-nlp
nlp classification
Implementing nlp papers relevant to classification with PyTorch, gluonnlp
Stars: ✭ 224 (+229.41%)
Mutual labels:  pytorch-nlp

Code for the paper "Efficient Adaption of Pretrained Transformers for Abstractive Summarization"

Requirements

To run the training script in train.py you will need in addition:

  • PyTorch (version >=0.4)
  • tqdm
  • pyrouge
  • newsroom
  • tensorflow (cpu version is ok)
  • nltk
  • spacy (and 'en' model)

You can download the weights of the OpenAI pre-trained version by cloning Alec Radford's repo and placing the model folder containing the pre-trained weights in the present repo.

In order to run this code, you will need to pre-process the datasets using bpe through the scripts provided in scripts

Dataset Preprocessing

The training and evaluation scripts expect 3 total output files: train_encoded.jsonl, val_encoded.jsonl, and test_encoded.jsonl

CNN/Daily Mail

The data and splits used in the paper can be downloaded from OpenNMT. First, remove the start and end sentence tags using the sed command in the link provided. To process the data, run the following command:

python scripts/encode_cnndm.py --src_file {source file} --tgt_file {target file} --out_file {output file}

XSum

The data and splits used in the paper can be scraped using XSum. Run the commands up through Extract text from HTML Files section. To process the data, run the following command:

python scripts/encode_xsum.py --summary_dir {summary directory} --splits_file {split file} --train_file {train file} --val_file {val file} --test_file {test_file}

Newsroom

The data and splits used in the paper can be downloaded from Newsroom. To process the data, run the following command:

python scripts/encode_newsroom.py --in_file {input split file} --out_file {output file}

Training

To train a model, run the following command:

python train.py \
  --data_dir {directory containing encoded data} \
  --output_dir {name of folder to save data in} \
  --experiment_name {name of experiment to save data with} \
  --show_progress \
  --doc_model \
  --num_epochs_dat 10 \
  --num_epochs_ft 10 \
  --n_batch 16 \
  --accum_iter 4 \
  --use_pretrain

to train the pre-trained document embedding model over dataset for 10 epochs using domain adaptive training, and 10 epochs using fine tuning. The model will be trained with a effective batch size of 64, since the actual batch size is 16 and we accumulate gradients over 4 batches. Batch size must be divisible by the number of gpus available. Training is currently optimized for multi-gpu usage, and may not work for single gpu machines.

Evaluation

To evaluate a model, run the following command:

python evaluate.py \
  --data_file {path to encoded data file encoded data} \
  --checkpoint {checkpoint to load model weights from} \
  --beam {beam size to do beam search with} \
  --doc_model \
  --save_file {file to output results to} \
  --n_batch {batch size for evaluation, must be divisible by number of gpus}

to evaluate the document embedding model on the test set. Evaluation is currently optimized for multi-gpu usage, and may not work for single gpu machines. Since the evaluation script will leave out some examples if the number of data points isn't divisible by the number of gpus, you might need to run the create_small_test.py script to get the last few files that are being left out and aggregate results at the end.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].