All Projects → jiacheng-xu → Discobert

jiacheng-xu / Discobert

Licence: mit
Code for paper "Discourse-Aware Neural Extractive Text Summarization" (ACL20)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Discobert

Pythonrouge
Python wrapper for evaluating summarization quality by ROUGE package
Stars: ✭ 155 (+29.17%)
Mutual labels:  natural-language-processing, text-summarization
Awesome Text Summarization
The guide to tackle with the Text Summarization
Stars: ✭ 990 (+725%)
Mutual labels:  natural-language-processing, text-summarization
Text summarization with tensorflow
Implementation of a seq2seq model for summarization of textual data. Demonstrated on amazon reviews, github issues and news articles.
Stars: ✭ 226 (+88.33%)
Mutual labels:  natural-language-processing, text-summarization
Textrank
TextRank implementation for Python 3.
Stars: ✭ 1,008 (+740%)
Mutual labels:  natural-language-processing, text-summarization
Paper Reading
Paper reading list in natural language processing, including dialogue systems and text generation related topics.
Stars: ✭ 508 (+323.33%)
Mutual labels:  natural-language-processing, text-summarization
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+843.33%)
Mutual labels:  natural-language-processing, text-summarization
Opus Mt
Open neural machine translation models and web services
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Unified Summarization
Official codes for the paper: A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss.
Stars: ✭ 114 (-5%)
Mutual labels:  natural-language-processing
Danlp
DaNLP is a repository for Natural Language Processing resources for the Danish Language.
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Awesome Emotion Recognition In Conversations
A comprehensive reading list for Emotion Recognition in Conversations
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Pytextrank
Python implementation of TextRank for phrase extraction and summarization of text documents
Stars: ✭ 1,675 (+1295.83%)
Mutual labels:  natural-language-processing
Flair
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Stars: ✭ 11,065 (+9120.83%)
Mutual labels:  natural-language-processing
Tensorflow Nlp
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x
Stars: ✭ 1,487 (+1139.17%)
Mutual labels:  natural-language-processing
Textsum Gan
Tensorflow re-implementation of GAN for text summarization
Stars: ✭ 111 (-7.5%)
Mutual labels:  text-summarization
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (-4.17%)
Mutual labels:  natural-language-processing
Nlp Papers
Papers and Book to look at when starting NLP 📚
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Dynamic Coattention Network Plus
Dynamic Coattention Network Plus (DCN+) TensorFlow implementation. Question answering using Deep NLP.
Stars: ✭ 117 (-2.5%)
Mutual labels:  natural-language-processing
Transformersum
Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
Stars: ✭ 107 (-10.83%)
Mutual labels:  text-summarization
Declutr
The corresponding code from our paper "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations". Do not hesitate to open an issue if you run into any trouble!
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Stanford Tensorflow Tutorials
This repository contains code examples for the Stanford's course: TensorFlow for Deep Learning Research.
Stars: ✭ 10,098 (+8315%)
Mutual labels:  natural-language-processing

💃DiscoBERT: Discourse-Aware Neural Extractive Text Summarization

Code repository for an ACL 2020 paper Discourse-Aware Neural Extractive Text Summarization.

Authors: Jiacheng Xu (University of Texas at Austin), Zhe Gan, Yu Cheng, and Jingjing Liu (Microsoft Dynamics 365 AI Research).

Contact: jcxu at cs dot utexas dot edu

Illustration

EDU Segmentation & Parsing

Here is an example of discourse segmentation and RST tree conversion.

Construction of Graphs: An Example

The proposed discourse-aware model selects EDUs 1-1, 2-1, 5-2, 20-1, 20-3, 22-1. The right side of the figure illustrates the two discourse graphs we use: (1) Coref(erence) Graph (with the mentions of `Pulitzer prizes' highlighted as examples); and (2) RST Graph (induced by RST discourse trees).

Prerequisites

The code is based on AllenNLP (v0.9), The code is developed with python 3, allennlp and pytorch>=1.0. For more requirements, please check requirements.txt.

Preprocessed Dataset & Model Archive

We maintain the preprocessed CNNDM, pre-trained CNNDM model w. discourse graph and coref graph, and pre-trained NYT model w. discourse graph and coref graph are provided in https://utexas.box.com/v/DiscoBERT-ACL2020.

The split of NYT is provided at data_preparation/urls_nyt/mapping_{train, valid, test}.txt.

Training

The model framework (training, evaluation, etc.) is based on AllenNLP (v0.9). The usage of most framework related hyper-parameters, like batch size, cuda device, num of samples per epoch, can be referred to AllenNLP document.

Here are some model related hyper-parameters:

Hyper-parameter Value Usage
use_disco bool Using EDU as the selection unit or not. If not use sentence instead.
trigram_block bool Using trigram blocking or not.
min_pred_unit & max_pred_unit int The minimal and maximal number of units (either EDUs or sentences) to choose during inference. The typical value for selecting EDUs on CNNDM and NYT is [5,8) and [5,8).
use_disco_graph bool Using discourse graph for graph encoding.
use_coref bool Using coreference mention graph for graph encoding.

Comments:

  • The hyper-parameters for BERT encoder is almost same as the configuration from PreSumm.
  • Inflating the number of units getting predicted for EDU-based models because EDUs are generally shorter than sentences. For CNNDM, we found that picking up 5 EDUs yields the best ROUGE F-1 score where for sentence-based model four sentences are picked.
  • We hardcoded some of vector dimension size to be 768 because we use bert-base-uncased model.
  • We tried roberta-base rather than bert-base-uncased as we used in this code repo and paper, but empirically it didn't perform better in our preliminary experiments.
  • The maxium document length is set to be 768 BPEs although we found max_len=768 doesn't bring significant gain from max_len=512.

To train or modify a model, there are several files to start with.

  • model/disco_bert.py is the model file. There are some unused conditions and hyper-parameters starting with "semantic_red" so you should ignore them.
  • configs/DiscoBERT.jsonnet is the configuration file which will be read by AllenNLP framework. In the pre-trained model section of https://utexas.box.com/v/DiscoBERT-ACL2020, we provided the configuration files for reference. Basically we adopted most of the hyper-parameters from PreSumm.

Here is a quick reference about our model performance based on bert-base-uncased.

CNNDM: | Model | R1/R2/RL | | :-------------: |:-------------:| | DiscoBERT | 43.38/20.44/40.21 | | DiscoBERT w. RST and Coref Graphs | 43.77/20.85/40.67 |

NYT: | Model | R1/R2/RL | | :-------------: |:-------------:| | DiscoBERT | 49.78/30.30/42.44 | | DiscoBERT w. RST and Coref Graphs | 50.00/30.38/42.70 |

Citing

@inproceedings{xu-etal-2020-discourse,
    title = {Discourse-Aware Neural Extractive Text Summarization},
    author = {Xu, Jiacheng and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = {2020},
    publisher = "Association for Computational Linguistics"
}

Acknowledgements

  • The data preprocessing (dataset handler, oracle creation, etc.) is partially based on PreSumm by Yang Liu and Mirella Lapata.
  • Data preprocessing (tokenization, sentence split, coreference resolution etc.) used CoreNLP.
  • RST Discourse Segmentation is generated from NeuEDUSeg. I slightly modified the code to run with GPU. Please check my modification here.
  • RST Discourse Parsing is generated from DPLP. My customized version is here featuring batch implementation and remaining file detection. Empirically I found that NeuEDUSeg provided better segmentation output than DPLP so we use NeuEDUSeg for segmentation and DPLP for parsing.
  • The implementation of the graph module is based on DGL.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].