All Projects → golsun → Dialogrpt

golsun / Dialogrpt

Licence: mit
EMNLP 2020: "Dialogue Response Ranking Training with Large-Scale Human Feedback Data"

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dialogrpt

Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (-36.57%)
Mutual labels:  dataset, dialog
Seq2seqchatbots
A wrapper around tensor2tensor to flexibly train, interact, and generate data for neural chatbots.
Stars: ✭ 466 (+115.74%)
Mutual labels:  dataset, dialog
Mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
Stars: ✭ 4,713 (+2081.94%)
Mutual labels:  pretrained-models, dialog
Gensim Data
Data repository for pretrained NLP models and NLP corpora.
Stars: ✭ 622 (+187.96%)
Mutual labels:  dataset, pretrained-models
Personalized Dialog
Code for the paper 'Personalization in Goal-oriented Dialog' (NeurIPS 2017 Conversational AI Workshop)
Stars: ✭ 109 (-49.54%)
Mutual labels:  dataset, dialog
Musical Onset Efficient
Supplementary information and code for the paper: An efficient deep learning model for musical onset detection
Stars: ✭ 26 (-87.96%)
Mutual labels:  dataset, pretrained-models
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (+128.24%)
Mutual labels:  dataset, pretrained-models
Dialog corpus
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+669.44%)
Mutual labels:  dataset, dialog
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1022.69%)
Mutual labels:  dataset, pretrained-models
Pmalertcontroller
PMAlertController is a great and customizable alert that can substitute UIAlertController
Stars: ✭ 2,397 (+1009.72%)
Mutual labels:  dialog
Dialogue
Node based dialogue system
Stars: ✭ 207 (-4.17%)
Mutual labels:  dialog
Awesome Json Datasets
A curated list of awesome JSON datasets that don't require authentication.
Stars: ✭ 2,421 (+1020.83%)
Mutual labels:  dataset
Accessible modal window
Accessible modal dialogs
Stars: ✭ 196 (-9.26%)
Mutual labels:  dialog
Mini Imagenet Tools
Tools for generating mini-ImageNet dataset and processing batches
Stars: ✭ 209 (-3.24%)
Mutual labels:  dataset
Trump Lies
Tutorial: Web scraping in Python with Beautiful Soup
Stars: ✭ 201 (-6.94%)
Mutual labels:  dataset
Pynasa
Stars: ✭ 212 (-1.85%)
Mutual labels:  dataset
Unit Uskit
unit-uskit
Stars: ✭ 197 (-8.8%)
Mutual labels:  dialog
Arbitrary Text To Image Papers
A collection of arbitrary text to image papers with code (constantly updating)
Stars: ✭ 196 (-9.26%)
Mutual labels:  dialog
Ava downloader
⏬ Download AVA dataset (A Large-Scale Database for Aesthetic Visual Analysis)
Stars: ✭ 214 (-0.93%)
Mutual labels:  dataset
Omnianomaly
KDD 2019: Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network
Stars: ✭ 208 (-3.7%)
Mutual labels:  dataset



DialogRPT: Dialog Ranking Pretrained Transformers

DialogRPT predicts human feedback (upvotes👍 or replies💬) of dialogue responses.

It is a set of dialog response ranking models proposed by Microsoft Research NLP Group trained on 100 + millions of human feedback data, accepted to appear at EMNLP'20. It can be used to improve existing dialog generation model (e.g., DialoGPT) by re-ranking the generated response candidates. This repo provides a PyTorch implementation and pretrained models.

Quick links:

We considered the following tasks and provided corresponding pretrained models. (Click 💾 to download original pytorch checkpoint for this repo, or click 🤗 to use HuggingFace model card)

Task Description Pretrained model
Human feedback
updown How likely the response gets the most upvotes? 💾 / 🤗
width How likely the response gets the most direct replies? 💾 / 🤗
depth How likely the response gets the longest follow-up thread? 💾 / 🤗
Human-like (human vs fake)
human_vs_rand How relevant the response is for the given context? 💾 / 🤗
human_vs_machine How likely the response is human-written rather than machine-generated? 💾 / 🤗

Contents:

Quick Start

Install

Option 1: run locally

git clone https://github.com/golsun/DialogRPT
cd DialogRPT
conda create -n dialogrpt python=3.6
conda activate dialogrpt
pip install -r requirements.txt

Option 2: run on Colab Notebook. You can either use Demo (original) or Demo (HuggingFace)

Use rankers only

In the following example, the model predicts that, given the same context "I love NLP!", response "Here’s a free textbook (URL) in case anyone needs it." is gets more upvotes than response "Me too!".

python src/score.py play -p=restore/updown.pth
#
# Context:  I love NLP!
# Response: Here’s a free textbook (URL) in case anyone needs it.
# score = 0.613

# Context:  I love NLP!
# Response: Me too!
# score = 0.111

You can also play the ensemble model, which involves multiple models defined in its config file (see this file for details).

python src/main.py play -p=restore/ensemble.yml

To score a list of (context, response) pairs, please provide a input file (--data), which is tab-separated in format context \t response0 \t response1 .... See example input file

python src/score.py test --data=doc/toy.tsv -p=restore/updown.pth
# downloading pretrained model to restore/updown.pth
# 100% [....................] 1520029114 / 1520029114
# loading from restore/updown.pth
# ranking doc/toy.tsv
# totally processed 2 line, avg_hyp_score 0.264, top_hyp_score 0.409
# results saved to doc/toy.tsv.ranked.jsonl
python src/score.py test --data=doc/toy.tsv -p=restore/ensemble.yml

Statistics of the scoring results can be shown with the following command, e.g. for doc/toy.tsv.ensemble.jsonl

python src/score.py stats --data=doc/toy.tsv.ensemble.jsonl
#                         |best   |avg
# ----------------------------------------
#               _score    |0.339  |0.206
#        human_vs_rand    |0.928  |0.861
#     human_vs_machine    |0.575  |0.525
#               updown    |0.409  |0.264
#                depth    |0.304  |0.153
#                width    |0.225  |0.114
#                final    |0.339  |0.206
# ----------------------------------------
# n_cxt: 2
# avg n_hyp per cxt: 2.50

Use generator + ranker

Dialog generation models can be improved by integrating with the response ranking models. For example, given the context "Can we restart 2020?", DialoGPT may return the following responses by sampling decoding (or you can try beam search without --sampling). Some of them, e.g., "Yes, we can." has a high generation probability (gen 0.496), but less interesting (ranker 0.302). So the rankers will put in position lower than ones more likely to be upvoted, e.g. "I think we should go back to the beginning, and start from the beginning." which is relatively less likely to be generated (gen 0.383) but seems more interesting (ranker 0.431)

python src/generation.py play -pg=restore/medium_ft.pkl -pr=restore/updown.pth --sampling
#
# Context:        Can we restart 2020?
# 0.431 gen 0.383 ranker 0.431    I think we should go back to the beginning, and start from the beginning.
# 0.429 gen 0.227 ranker 0.429    I think I'll just sit here and wait for 2020
# 0.377 gen 0.249 ranker 0.377    Yeah, let's just start from the beginning
# 0.323 gen 0.195 ranker 0.323    I think we should just give up and let the year just pass.
# 0.304 gen 0.395 ranker 0.304    Yes. We can.
# 0.302 gen 0.496 ranker 0.302    Yes, we can.
# 0.283 gen 0.351 ranker 0.283    It's been a while since we've seen a good reboot.
# 0.174 gen 0.306 ranker 0.174    I'm up for it
# 0.168 gen 0.463 ranker 0.168    I'm down
# 0.153 gen 0.328 ranker 0.153    I think so, yes.
# ...

Similarly, you can use the ensemble model.

python src/generation.py -pg=restore/medium_ft.pkl -pr=restore/ensemble.yml

To generate from a list of contexts stored in a line-separated file, provide it with --path_test and use the command below:

python src/generation.py test --path_test=path/to/list/of/contexts -pg=restore/medium_ft.pkl -pr=restore/ensemble.yml

Data

Traning dataset can be built with this script, which downloads raw data from a third party dump and extract comparable pairs of comments for classification tasks.

sh data.sh

Testing data can be downloaded here use the command below

wget https://xiagnlp2.blob.core.windows.net/dialogrpt/test.zip
unzip test.zip

Please checkout our Dataset webpage for data examples, description, statistics and more.

Training

We use DialoGPT to initialize the model. Please download with

wget https://convaisharables.blob.core.windows.net/lsp/multiref/medium_ft.pkl -P restore

For the human feedback prediction tasks, we specify min_score_gap and min_rank_gap to only validate on less-noisy samples (not applied to training).

python src/main.py train --data=data/out/updown -p=restore/medium_ft.pkl --min_score_gap=20 --min_rank_gap=0.5
python src/main.py train --data=data/out/depth -p=restore/medium_ft.pkl --min_score_gap=4 --min_rank_gap=0.5
python src/main.py train --data=data/out/width -p=restore/medium_ft.pkl --min_score_gap=4 --min_rank_gap=0.5

For human_vs_rand task, use the --mismatch flag to feed rand human response as negative examples. We can reuse previous dataset (e.g. data/out/updown).

python src/main.py train --data=data/out/updown -p=restore/medium_ft.pkl --mismatch

For human_vs_machine task, we build dataset by pair human response with a response generated by DialoGPT with topk decoding

python src/main.py train --data=data/out/human_vs_machine -p=restore/medium_ft.pkl

We trained all models on a Nvidia V100 4-core GPU (each core with 32G memory) with the following hyperparameters. Checkpoint with the best validation accuracy is used as final model. | Argument | Value | Description | | :------------- | :-----------: |:------------- | | batch | 256 | total batch size for all GPUs. | | vali_size | 1024 | number of samples used for validation (i.e. dev set size). | | lr | 3e-05 | learning rate | | max_seq_len | 50 | max allowed sequence length.
if longer, leading tokens will be truncated | | max_hr_gap | 1 | max allowed hour difference between positive and negative samples.
If longer, this pair will be discarded for train/vali|

Evaluation

Human feedback prediction

The performance on updown, depth, and width can be measured with the following commands, respectively. The --min_score_gap and --min_rank_gap arguments are consistent with the values used to measure validation loss during training.

python src/score.py eval_human_feedback -p=restore/updown.pth --data=test/human_feedback/updown.tsv --min_score_gap=20 --min_rank_gap=0.5
python src/score.py eval_human_feedback -p=restore/depth.pth --data=test/human_feedback/depth.tsv --min_score_gap=4 --min_rank_gap=0.5
python src/score.py eval_human_feedback -p=restore/width.pth --data=test/human_feedback/width.tsv --min_score_gap=4 --min_rank_gap=0.5

The expected pairwise accuracy on 5000 test samples is listed in the table below (from Table 5 of the paper). Note even by random guess one can get accuracy of 0.500. | human feedback | updown | depth | width | | :------------- | :------: |:------------: |:--------: | | Dialog ppl. | 0.488 | 0.508 | 0.513 | | Reverse dialog ppl. | 0.560 | 0.557 | 0.571 | | DialogRPT (ours)| 0.683 | 0.695 | 0.752 |

Human-like classification

  • human_vs_rand task: Although the model is trained on reddit corpus only, we measured its zero-shot performance on several unseen corpora (twitter, dailydialog and personachat)
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/reddit
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/dailydialog
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/twitter
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/personachat

The expected [email protected] metric on 5000 test samples is listed in the table below (from Table 7 of the paper). [email protected] measures, for the same context, given k positive responses and n negative responses, how many positive responses are in top-k of the ranked responses. | human_vs_rand | reddit | dailydialog | twitter | personachat | | :------------- | :------: |:------------: |:--------: |:------------: | | BM25 | 0.309 | 0.182 | 0.178 | 0.117 | | Dialog ppl. | 0.560 | 0.176 | 0.107 | 0.108 | | Reverse dialog ppl. | 0.775 | 0.457 | 0.440 | 0.449 | | ConveRT | 0.760 | 0.380 | 0.439 | 0.197 | | DialogRPT (ours)| 0.886 | 0.621 | 0.548 | 0.479 |

  • human_vs_machine task: its performance is only evaluated for reddit corpus.
python src/score.py --task=eval_human_vs_machine -p=restore/human_vs_machine.pth --data=test/human_vs_fake/reddit
# expecting accuracy ~0.98

Citation

If you use our dataset or model, please cite our paper

@inproceedings{gao2020dialogrpt,
    title={Dialogue Response RankingTraining with Large-Scale Human Feedback Data},
    author={Xiang Gao and Yizhe Zhang and Michel Galley and Chris Brockett and Bill Dolan},
    year={2020},
    booktitle={EMNLP}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].