All Projects → Junpliu → ConDigSum

Junpliu / ConDigSum

Licence: MIT license
Code for EMNLP 2021 paper "Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization"

Programming Languages

python
139335 projects - #7 most used programming language
Cuda
1817 projects

Projects that are alternatives of or similar to ConDigSum

2021-dialogue-summary-competition
[2021 훈민정음 한국어 음성•자연어 인공지능 경진대회] 대화요약 부문 알라꿍달라꿍 팀의 대화요약 학습 및 추론 코드를 공유하기 위한 레포입니다.
Stars: ✭ 86 (+38.71%)
Mutual labels:  dialogue, summarization
DocSum
A tool to automatically summarize documents abstractively using the BART or PreSumm Machine Learning Model.
Stars: ✭ 58 (-6.45%)
Mutual labels:  bart, summarization
Html2article
Html网页正文提取
Stars: ✭ 441 (+611.29%)
Mutual labels:  topic
Kafka Monitor
Xinfra Monitor monitors the availability of Kafka clusters by producing synthetic workloads using end-to-end pipelines to obtain derived vital statistics - E2E latency, service produce/consume availability, offsets commit availability & latency, message loss rate and more.
Stars: ✭ 1,817 (+2830.65%)
Mutual labels:  topic
Kafkawize
Kafkawize : A Self service Apache Kafka Topic Management tool/portal. A Web application which automates the process of creating and browsing Kafka topics, acls, schemas by introducing roles/authorizations to users of various teams of an org.
Stars: ✭ 79 (+27.42%)
Mutual labels:  topic
Swoole Jobs
🚀Dynamic multi process worker queue base on swoole, like gearman but high performance.
Stars: ✭ 574 (+825.81%)
Mutual labels:  topic
Kafka Visualizer
A web client for visualizing your Apache Kafka topics live.
Stars: ✭ 98 (+58.06%)
Mutual labels:  topic
Logi Kafkamanager
一站式Apache Kafka集群指标监控与运维管控平台
Stars: ✭ 3,280 (+5190.32%)
Mutual labels:  topic
Ldagibbssampling
Open Source Package for Gibbs Sampling of LDA
Stars: ✭ 218 (+251.61%)
Mutual labels:  topic
Hackerqueue
Your favorite tech sites compiled down to topics you find interesting.
Stars: ✭ 55 (-11.29%)
Mutual labels:  topic
Weibo Topic Spider
微博超级话题爬虫,微博词频统计+情感分析+简单分类,新增肺炎超话爬取数据
Stars: ✭ 128 (+106.45%)
Mutual labels:  topic
Ieml
IEML semantic language - a meaning-representation system based on semantic primitives and a regular grammar. Basic semantic relationships between concepts are automatically computed from syntactic similarities.
Stars: ✭ 41 (-33.87%)
Mutual labels:  topic
Bertopic
Leveraging BERT and c-TF-IDF to create easily interpretable topics.
Stars: ✭ 745 (+1101.61%)
Mutual labels:  topic
Qcloud Iot Sdk Embedded C
SDK for connecting to Tencent Cloud IoT from a device using embedded C.
Stars: ✭ 109 (+75.81%)
Mutual labels:  topic
Wsify
Just a tiny, simple and real-time self-hosted pub/sub messaging service
Stars: ✭ 452 (+629.03%)
Mutual labels:  topic
Proposal Smart Pipelines
Old archived draft proposal for smart pipelines. Go to the new Hack-pipes proposal at js-choi/proposal-hack-pipes.
Stars: ✭ 177 (+185.48%)
Mutual labels:  topic
Hacker News Digest
📰 A responsive interface of Hacker News with summaries and thumbnails.
Stars: ✭ 278 (+348.39%)
Mutual labels:  topic
Presentations
Holds and organizes all past, present, and future presentations at the meetup
Stars: ✭ 30 (-51.61%)
Mutual labels:  topic
Edamontology
EDAM is an ontology of bioinformatics types of data including identifiers, data formats, operations and topics.
Stars: ✭ 80 (+29.03%)
Mutual labels:  topic
SRB
Code for "Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization"
Stars: ✭ 41 (-33.87%)
Mutual labels:  summarization

Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization

Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, Xiaojie Wang EMNLP 2021 Paper

Requirements and Installation

Conda is highly recommended to manage your Python environment.

pip install --editable ./
pip install requests rouge==1.0.0
pip install transformers==4.4.0 bert-score==0.3.8

Training ConDigSum model

Before training ConDigSum, please download BART-Large from here, and update PRETRAIN_PATH to the path of model.pt in training scripts.

For SAMSum and MediaSum datasets, you can download preprocessed data files directly (SAMSum, MediaSum), which results in train_sh/SAMSumInd/ and train_sh/mediasum/.

Change working directory and download:

cd train_sh
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'

SAMSum dataset

# SAMSum
./train_samsum.sh [training_comment] [gpu_id]

MediaSum dataset

# MediaSum
./train_mediasum.sh [training_comment] [gpu_id]

Custom dataset

To facilitate training on custom datasets, a demo dataset is provided in train_sh/customdata/ directory, please prepare your own data files following the *.jsonl files. Then, pre-processing steps are as follows:

./bpe.sh
./binarize.sh

Testing ConDigSum model

Downloading pretrained ConDigSum models

Pretrained models and predictions are provided at Google Drive: SAMSum, MediaSum. After downloading, train_sh/SAMSum.condigsum/checkpoint_best.pt and train_sh/MediaSum.condigsum/checkpoint_best.pt will be gotten.

Evaluating models

# dataname=SAMSumInd or dataname=mediasum or dataname=customdata
# checkpoint_dir=SAMSum.condigsum or checkpoint_dir=MediaSum.condigsum

# generate predictions
cd train_sh
CUDA_VISIBLE_DEVICES=${GPU} python ./test.py --log_dir ${checkpoint_dir} --dataset ${dataname}

# get file2rouge scores
files2rouge ${dataname}/test.target ${checkpoint_dir}/test.hypo

# calculate bert-score scores
CUDA_VISIBLE_DEVICES=${GPU} bert-score -r ${dataname}/test.target -c ${checkpoint_dir}/test.hypo --lang en --rescale_with_baseline

Citation

@inproceedings{liu-etal-2021-topic-aware,
    title = "Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization",
    author = "Liu, Junpeng  and
      Zou, Yanyan  and
      Zhang, Hainan  and
      Chen, Hongshen  and
      Ding, Zhuoye  and
      Yuan, Caixia  and
      Wang, Xiaojie",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.106",
    doi = "10.18653/v1/2021.findings-emnlp.106",
    pages = "1229--1243",
    abstract = "Unlike well-structured text, such as news reports and encyclopedia articles, dialogue content often comes from two or more interlocutors, exchanging information with each other. In such a scenario, the topic of a conversation can vary upon progression and the key information for a certain topic is often scattered across multiple utterances of different speakers, which poses challenges to abstractly summarize dialogues. To capture the various topic information of a conversation and outline salient facts for the captured topics, this work proposes two topic-aware contrastive learning objectives, namely coherence detection and sub-summary generation objectives, which are expected to implicitly model the topic change and handle information scattering challenges for the dialogue summarization task. The proposed contrastive objectives are framed as auxiliary tasks for the primary dialogue summarization task, united via an alternative parameter updating strategy. Extensive experiments on benchmark datasets demonstrate that the proposed simple method significantly outperforms strong baselines and achieves new state-of-the-art performance. The code and trained models are publicly available via .",
}

MISC

  1. To install Files2ROUGE on centos system, you may need to install dependencies to avoid some problems.
yum install -y "perl(XML::Parser)"
yum install -y "perl(XML::LibXML)"
yum install -y "perl(DB_File)"
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].