Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.

Stars: ✭ 1,132 (+843.33%)

Mutual labels: natural-language-processing, text-summarization

Opus Mt

Open neural machine translation models and web services

Stars: ✭ 111 (-7.5%)

Mutual labels: natural-language-processing

Unified Summarization

Official codes for the paper: A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss.

Stars: ✭ 114 (-5%)

Mutual labels: natural-language-processing

Danlp

DaNLP is a repository for Natural Language Processing resources for the Danish Language.

Stars: ✭ 111 (-7.5%)

Mutual labels: natural-language-processing

Awesome Emotion Recognition In Conversations

A comprehensive reading list for Emotion Recognition in Conversations

Stars: ✭ 111 (-7.5%)

Mutual labels: natural-language-processing

Pytextrank

Python implementation of TextRank for phrase extraction and summarization of text documents

Stars: ✭ 1,675 (+1295.83%)

Mutual labels: natural-language-processing

Flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Stars: ✭ 11,065 (+9120.83%)

Mutual labels: natural-language-processing

Tensorflow Nlp

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Stars: ✭ 1,487 (+1139.17%)

Mutual labels: natural-language-processing

Textsum Gan

Tensorflow re-implementation of GAN for text summarization

Stars: ✭ 111 (-7.5%)

Mutual labels: text-summarization

Cogcomp Nlpy

CogComp's light-weight Python NLP annotators

Stars: ✭ 115 (-4.17%)

Mutual labels: natural-language-processing

Nlp Papers

Papers and Book to look at when starting NLP 📚

Stars: ✭ 111 (-7.5%)

Mutual labels: natural-language-processing

Dynamic Coattention Network Plus

Dynamic Coattention Network Plus (DCN+) TensorFlow implementation. Question answering using Deep NLP.

Stars: ✭ 117 (-2.5%)

Mutual labels: natural-language-processing

Transformersum

Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.

Stars: ✭ 107 (-10.83%)

Mutual labels: text-summarization

Declutr

The corresponding code from our paper "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations". Do not hesitate to open an issue if you run into any trouble!

Stars: ✭ 111 (-7.5%)

Mutual labels: natural-language-processing

Stanford Tensorflow Tutorials

This repository contains code examples for the Stanford's course: TensorFlow for Deep Learning Research.

Stars: ✭ 10,098 (+8315%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

💃DiscoBERT: Discourse-Aware Neural Extractive Text Summarization

Code repository for an ACL 2020 paper Discourse-Aware Neural Extractive Text Summarization.

Authors: Jiacheng Xu (University of Texas at Austin), Zhe Gan, Yu Cheng, and Jingjing Liu (Microsoft Dynamics 365 AI Research).

Contact: jcxu at cs dot utexas dot edu

Illustration

EDU Segmentation & Parsing

Here is an example of discourse segmentation and RST tree conversion.

Construction of Graphs: An Example

The proposed discourse-aware model selects EDUs 1-1, 2-1, 5-2, 20-1, 20-3, 22-1. The right side of the figure illustrates the two discourse graphs we use: (1) Coref(erence) Graph (with the mentions of `Pulitzer prizes' highlighted as examples); and (2) RST Graph (induced by RST discourse trees).

Prerequisites

The code is based on AllenNLP (v0.9), The code is developed with python 3, allennlp and pytorch>=1.0. For more requirements, please check requirements.txt.

Preprocessed Dataset & Model Archive

We maintain the preprocessed CNNDM, pre-trained CNNDM model w. discourse graph and coref graph, and pre-trained NYT model w. discourse graph and coref graph are provided in https://utexas.box.com/v/DiscoBERT-ACL2020.

The split of NYT is provided at data_preparation/urls_nyt/mapping_{train, valid, test}.txt.

Training

The model framework (training, evaluation, etc.) is based on AllenNLP (v0.9). The usage of most framework related hyper-parameters, like batch size, cuda device, num of samples per epoch, can be referred to AllenNLP document.

Here are some model related hyper-parameters:

Hyper-parameter	Value	Usage
`use_disco`	bool	Using EDU as the selection unit or not. If not use sentence instead.
`trigram_block`	bool	Using trigram blocking or not.
`min_pred_unit` & `max_pred_unit`	int	The minimal and maximal number of units (either EDUs or sentences) to choose during inference. The typical value for selecting EDUs on CNNDM and NYT is [5,8) and [5,8).
`use_disco_graph`	bool	Using discourse graph for graph encoding.
`use_coref`	bool	Using coreference mention graph for graph encoding.

Comments:

The hyper-parameters for BERT encoder is almost same as the configuration from PreSumm.
Inflating the number of units getting predicted for EDU-based models because EDUs are generally shorter than sentences. For CNNDM, we found that picking up 5 EDUs yields the best ROUGE F-1 score where for sentence-based model four sentences are picked.
We hardcoded some of vector dimension size to be 768 because we use bert-base-uncased model.
We tried roberta-base rather than bert-base-uncased as we used in this code repo and paper, but empirically it didn't perform better in our preliminary experiments.
The maxium document length is set to be 768 BPEs although we found max_len=768 doesn't bring significant gain from max_len=512.

To train or modify a model, there are several files to start with.

model/disco_bert.py is the model file. There are some unused conditions and hyper-parameters starting with "semantic_red" so you should ignore them.
configs/DiscoBERT.jsonnet is the configuration file which will be read by AllenNLP framework. In the pre-trained model section of https://utexas.box.com/v/DiscoBERT-ACL2020, we provided the configuration files for reference. Basically we adopted most of the hyper-parameters from PreSumm.

Here is a quick reference about our model performance based on bert-base-uncased.

CNNDM: | Model | R1/R2/RL | | :-------------: |:-------------:| | DiscoBERT | 43.38/20.44/40.21 | | DiscoBERT w. RST and Coref Graphs | 43.77/20.85/40.67 |

NYT: | Model | R1/R2/RL | | :-------------: |:-------------:| | DiscoBERT | 49.78/30.30/42.44 | | DiscoBERT w. RST and Coref Graphs | 50.00/30.38/42.70 |

Citing

@inproceedings{xu-etal-2020-discourse,
    title = {Discourse-Aware Neural Extractive Text Summarization},
    author = {Xu, Jiacheng and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = {2020},
    publisher = "Association for Computational Linguistics"
}

Acknowledgements

The data preprocessing (dataset handler, oracle creation, etc.) is partially based on PreSumm by Yang Liu and Mirella Lapata.
Data preprocessing (tokenization, sentence split, coreference resolution etc.) used CoreNLP.
RST Discourse Segmentation is generated from NeuEDUSeg. I slightly modified the code to run with GPU. Please check my modification here.
RST Discourse Parsing is generated from DPLP. My customized version is here featuring batch implementation and remaining file detection. Empirically I found that NeuEDUSeg provided better segmentation output than DPLP so we use NeuEDUSeg for segmentation and DPLP for parsing.
The implementation of the graph module is based on DGL.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 120

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (9) 🔗