All Projects → abrazinskas → Copycat-abstractive-opinion-summarizer

abrazinskas / Copycat-abstractive-opinion-summarizer

Licence: MIT License
ACL 2020 Unsupervised Opinion Summarization as Copycat-Review Generation

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Copycat-abstractive-opinion-summarizer

PlanSum
[AAAI2021] Unsupervised Opinion Summarization with Content Planning
Stars: ✭ 25 (-67.11%)
Mutual labels:  amazon, reviews, yelp, summarization, natural-language-generation, abstractive-text-summarization, abstractive-summarization, opinion-summarization
DocSum
A tool to automatically summarize documents abstractively using the BART or PreSumm Machine Learning Model.
Stars: ✭ 58 (-23.68%)
Mutual labels:  summarization, abstractive-text-summarization, abstractive-summarization
gazeta
Gazeta: Dataset for automatic summarization of Russian news / Газета: набор данных для автоматического реферирования на русском языке
Stars: ✭ 25 (-67.11%)
Mutual labels:  summarization, abstractive-text-summarization, abstractive-summarization
Entity2Topic
[NAACL2018] Entity Commonsense Representation for Neural Abstractive Summarization
Stars: ✭ 20 (-73.68%)
Mutual labels:  summarization, natural-language-generation, abstractive-summarization
SelSum
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.
Stars: ✭ 36 (-52.63%)
Mutual labels:  amazon, summarization
Mojitalk
Code for "MojiTalk: Generating Emotional Responses at Scale" https://arxiv.org/abs/1711.04090
Stars: ✭ 107 (+40.79%)
Mutual labels:  vae, natural-language-generation
Adversarial video summary
Unofficial PyTorch Implementation of SUM-GAN from "Unsupervised Video Summarization with Adversarial LSTM Networks" (CVPR 2017)
Stars: ✭ 187 (+146.05%)
Mutual labels:  vae, summarization
seq3
Source code for the NAACL 2019 paper "SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression"
Stars: ✭ 121 (+59.21%)
Mutual labels:  summarization, abstractive-summarization
factsumm
FactSumm: Factual Consistency Scorer for Abstractive Summarization
Stars: ✭ 83 (+9.21%)
Mutual labels:  summarization, abstractive-summarization
FewSum
Few-shot learning framework for opinion summarization published at EMNLP 2020.
Stars: ✭ 29 (-61.84%)
Mutual labels:  summarization, opinion-summarization
data-summ-cnn dailymail
non-anonymized cnn/dailymail dataset for text summarization
Stars: ✭ 12 (-84.21%)
Mutual labels:  summarization, abstractive-text-summarization
xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Stars: ✭ 160 (+110.53%)
Mutual labels:  abstractive-text-summarization, abstractive-summarization
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+147.37%)
Mutual labels:  summarization, natural-language-generation
go-amazon-product-advertising-api
Go Client Library for Amazon Product Advertising API
Stars: ✭ 51 (-32.89%)
Mutual labels:  amazon
amz sp api
AmzSpApi - Unofficial Ruby gem for the Selling Partner APIs (SP-API)
Stars: ✭ 22 (-71.05%)
Mutual labels:  amazon
jarvis
Jarvis Home Automation
Stars: ✭ 81 (+6.58%)
Mutual labels:  amazon
oracdc
Oracle database CDC (Change Data Capture)
Stars: ✭ 51 (-32.89%)
Mutual labels:  amazon
FYP-AutoTextSum
Automatic Text Summarization with Machine Learning
Stars: ✭ 16 (-78.95%)
Mutual labels:  summarization
sidenet
SideNet: Neural Extractive Summarization with Side Information
Stars: ✭ 52 (-31.58%)
Mutual labels:  summarization
ladder-vae-pytorch
Ladder Variational Autoencoders (LVAE) in PyTorch
Stars: ✭ 59 (-22.37%)
Mutual labels:  vae

Unsupervised Opinion Summarization as Copycat-Review Generation

This repository contains the Python (PyTorch) codebase of the corresponding paper accepted at ACL 2020, Seattle, USA.

The model is fully unsupervised and is trained on a large corpus of customer reviews, such as Yelp or Amazon. It generates abstractive summaries condensing common opinions across a group of reviews. It relies on Bayesian auto-encoding that fosters learning rich hierarchical semantic representations of reviews and products. Finally, the model uses a copy mechanism to better preserve details of input reviews.

Example summaries produced by the system are shown below.

  • This restaurant is a hidden gem in Toronto. The food is delicious, and the service is impeccable. Highly recommend for anyone who likes French bistro.

  • This is a great case for the Acer Aspire 14" laptop. It is a little snug for my laptop, but it's a nice case. I would recommend it to anyone who wants to protect their laptop.

  • This is the best steamer I have ever owned. It is easy to use and easy to clean. I have used it several times and it works great. I would recommend it to anyone looking for a steamer.

For more examples, please refer to the artifacts folder.

Installation

The easiest way to proceed is to create a separate conda environment.

conda create -n copycat python=3.6.9
conda activate copycat

Install required modules.

pip install -r requirements.txt

Add the root directory to the path.

export PYTHONPATH=root_path:$PYTHONPATH

Data

Our model is trained on two different collections of customer reviews - Amazon and Yelp. The evaluation was performed on human-annotated summaries based on both datasets.

Unsupervised data

The dataset for Yelp and Amazon must be preprocessed and put to the /data folder. See instructions in the preprocessing folder.

Input Data Format

If training should be performed on a separate dataset, the expected format of input is provided in artifacts. Each business/product has to be separated to CSV files where each line corresponds to a separate review.

group_id review_text rating category
159985130X We recommend the Magnifier ... 4.0 health_and_personal_care

The rating column is optional as it is not used by the model.

Evaluation Summaries

Evaluation can be performed on human-created summaries, both Amazon and Yelp summaries are publicly available. No preprocessing is needed for evaluation. The Amazon summaries were created by us using the Mechanical Turk Platform, more information on the process can be found in the corresponding folder.

Running

If you preprocessed data yourself, please create your vocabulary and truecaser. Otherwise, you can skip the following two sections.

Vocabulary Creation

Vocabulary contains to a mapping from words to frequency, where file position corresponds to ids used by the model.

python copycat/scripts/create_vocabulary.py --data_path=your_data_path --vocab_fp=data/dataset_name/vocabs/vocab.txt

Truecaser Creation

Truecaser is used to reverse lowercase letters, and needs to be trained (quickly) by scanning the dataset. Note that multiple folders can be assigned to the data_path parameter.

python copycat/scripts/train_truecaser.py --data_path=your_data_path --tcaser_fp=data/dataset_name/tcaser.model

Workflow

One needs to set parameters of the workflow in copycat/hparams/run_hp.py. E.g., by altering data paths or specifying the number of training epochs.

The file run_workflow.py contains a workflow of operations that are executed to prepare necessary objects (e.g., beam search) and then run a training and/or evaluation procedure. After adjusting run parameters, execute the following command.

python copycat/scripts/run_workflow.py

Summary generation

Generation of summaries from CSV files can also be done via the run_workflow.py file. The input must be in the CSV format as in copycat/amazon/data/infer_input.csv. Each review column must be in the format 'rev1', ..., 'revN'. Tab should be used as a separator.

python copycat/scripts/run_workflow.py --infer-input-file-path=your_csv_input_file_path --infer-batch-size=20

Checkpoints

Amazon and Yelp checkpoints are available for download. Please put them to copycat/artifacts/, to the corresponding dataset sub-folders.

LICENSE

MIT

Citation

@inproceedings{brazinskas2020-unsupervised,
    title = "Unsupervised Opinion Summarization as Copycat-Review Generation",
    author = "Bra{\v{z}}inskas, Arthur  and
      Lapata, Mirella  and
      Titov, Ivan",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.461",
    doi = "10.18653/v1/2020.acl-main.461",
    pages = "5151--5169"
}

Notes

  • Minor deviations from the published results are expected as the code was migrated from a bleeding-edge PyTorch version and Python 2.7.

  • Post factum, we added a beam search generator that has the n-gram blocking functionality (based on OpenNMT). The enhancement reduces repetitions.

  • The setup was fully tested with Python 3.6.9.

  • The model work on a single GPU only.

  • mltoolkit provides the backbone functionality for data processing and modelling. Make sure it's visible to the interpreter.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].