All Projects → HHousen → DocSum

HHousen / DocSum

Licence: GPL-3.0 license
A tool to automatically summarize documents abstractively using the BART or PreSumm Machine Learning Model.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to DocSum

PlanSum
[AAAI2021] Unsupervised Opinion Summarization with Content Planning
Stars: ✭ 25 (-56.9%)
Mutual labels:  text-summarization, summarization, abstractive-text-summarization, abstractive-summarization
gazeta
Gazeta: Dataset for automatic summarization of Russian news / Газета: набор данных для автоматического реферирования на русском языке
Stars: ✭ 25 (-56.9%)
Mutual labels:  text-summarization, summarization, abstractive-text-summarization, abstractive-summarization
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (-34.48%)
Mutual labels:  transformers, bart, text-summarization, abstractive-summarization
Entity2Topic
[NAACL2018] Entity Commonsense Representation for Neural Abstractive Summarization
Stars: ✭ 20 (-65.52%)
Mutual labels:  text-summarization, summarization, abstractive-summarization
xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Stars: ✭ 160 (+175.86%)
Mutual labels:  text-summarization, abstractive-text-summarization, abstractive-summarization
Copycat-abstractive-opinion-summarizer
ACL 2020 Unsupervised Opinion Summarization as Copycat-Review Generation
Stars: ✭ 76 (+31.03%)
Mutual labels:  summarization, abstractive-text-summarization, abstractive-summarization
TextRank-node
No description or website provided.
Stars: ✭ 21 (-63.79%)
Mutual labels:  text-summarization, summarization
Transformersum
Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
Stars: ✭ 107 (+84.48%)
Mutual labels:  text-summarization, summarization
data-summ-cnn dailymail
non-anonymized cnn/dailymail dataset for text summarization
Stars: ✭ 12 (-79.31%)
Mutual labels:  summarization, abstractive-text-summarization
awesome-text-summarization
Text summarization starting from scratch.
Stars: ✭ 86 (+48.28%)
Mutual labels:  text-summarization, abstractive-summarization
seq3
Source code for the NAACL 2019 paper "SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression"
Stars: ✭ 121 (+108.62%)
Mutual labels:  summarization, abstractive-summarization
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+224.14%)
Mutual labels:  transformers, summarization
Text summarization with tensorflow
Implementation of a seq2seq model for summarization of textual data. Demonstrated on amazon reviews, github issues and news articles.
Stars: ✭ 226 (+289.66%)
Mutual labels:  text-summarization, summarization
nlp-akash
Natural Language Processing notes and implementations.
Stars: ✭ 66 (+13.79%)
Mutual labels:  text-summarization, summarization
ConDigSum
Code for EMNLP 2021 paper "Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization"
Stars: ✭ 62 (+6.9%)
Mutual labels:  bart, summarization
Textrank
TextRank implementation for Python 3.
Stars: ✭ 1,008 (+1637.93%)
Mutual labels:  text-summarization, summarization
Pythonrouge
Python wrapper for evaluating summarization quality by ROUGE package
Stars: ✭ 155 (+167.24%)
Mutual labels:  text-summarization, summarization
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+5777.59%)
Mutual labels:  transformers, summarization
factsumm
FactSumm: Factual Consistency Scorer for Abstractive Summarization
Stars: ✭ 83 (+43.1%)
Mutual labels:  summarization, abstractive-summarization
long-short-transformer
Implementation of Long-Short Transformer, combining local and global inductive biases for attention over long sequences, in Pytorch
Stars: ✭ 103 (+77.59%)
Mutual labels:  transformers

DocSum Logo

DocSum

A tool to automatically summarize documents (or plain text) using either the BART or PreSumm Machine Learning Model.

Open In Colab

BART (BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension) is the state-of-the-art in text summarization as of 02/02/2020. It is a "sequence-to-sequence model trained with denoising as pretraining objective" (Documentation & Examples).

PreSumm (Text Summarization with Pretrained Encoders) applies BERT (Bidirectional Encoder Representations from Transformers) to text summarization by using "a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences." BERT represented "the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks" at the time of writing (Documentation & Examples).

Tasks

  1. Convert a PDF to XML and then interpret that XML file using the font property of each text element using main.py. Utilizes the xml.etree.elementtree python library.
  2. Summarize raw text input using cmd_summarizer.py. You can run this in Google Colaboratory by clicking this button: Open In Colab
  3. Summarize multiple text files using presumm/run_summarization.py

Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

Installation

sudo apt install poppler-utils
git clone https://github.com/HHousen/docsum.git
cd docsum
conda env create --file environment.yml
conda activate docsum

To convert PDF to XML

pdftohtml input.pdf -i -s -c -xml output.xml

Project Structure

DocSum
├── bart_sum.py
├── cmd_summarizer.py
├── docsum.png
├── environment.yml
├── LICENSE
├── main.py
├── presumm
│   ├── configuration_bertabs.py
│   ├── __init__.py
│   ├── modeling_bertabs.py
│   ├── presumm.py
│   ├── run_summarization.py
│   └── utils_summarization.py
├── README.md
└── xml_processor.py

Usage

Output of python main.py --help:

usage: main.py [-h] [-t {pdf,xml}] [-m {bart,presumm}] [--bart_checkpoint PATH] [--bart_state_dict_key PATH] [--bart_fairseq] -cf N [N ...]
               -bhf N [N ...] -bf N [N ...] [-ns] [--output_xml_path PATH] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
               PATH

Summarization of PDFs using BART

positional arguments:
  PATH                  path to input file

optional arguments:
  -h, --help            show this help message and exit
  -t {pdf,xml}, --file_type {pdf,xml}
                        type of file to summarize
  -m {bart,presumm}, --model {bart,presumm}
                        machine learning model choice
  --bart_checkpoint PATH
                        [BART Only] Path to optional checkpoint. Semsim is better model but will use more memory and is an additional 5GB
                        download. (default: none, recommended: semsim)
  --bart_state_dict_key PATH
                        [BART Only] model state_dict key to load from pickle file specified with --bart_checkpoint (default: "model")
  --bart_fairseq        [BART Only] Use fairseq model from torch hub instead of huggingface transformers library models. Can not use
                        --bart_checkpoint if this option is supplied.
  -cf N [N ...], --chapter_heading_font N [N ...]
                        font of chapter titles
  -bhf N [N ...], --body_heading_font N [N ...]
                        font of headings within chapter
  -bf N [N ...], --body_font N [N ...]
                        font of body (the text you want to summarize)
  -ns, --no_summarize   do not run the summarization step
  --output_xml_path PATH
                        path to output XML file if `file_type` is `pdf`
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: 'Info').

Output of python cmd_summarizer.py --help

usage: cmd_summarizer.py [-h] -m {bart,presumm} [--bart_checkpoint PATH] [--bart_state_dict_key PATH] [--bart_fairseq]
                         [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Summarization of text using CMD prompt

optional arguments:
  -h, --help            show this help message and exit
  -m {bart,presumm}, --model {bart,presumm}
                        machine learning model choice
  --bart_checkpoint PATH
                        [BART Only] Path to optional checkpoint. Semsim is better model but will use more memory and is an additional 5GB
                        download. (default: none, recommended: semsim)
  --bart_state_dict_key PATH
                        [BART Only] model state_dict key to load from pickle file specified with --bart_checkpoint (default: "model")
  --bart_fairseq        [BART Only] Use fairseq model from torch hub instead of huggingface transformers library models. Can not use
                        --bart_checkpoint if this option is supplied.
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: 'Info').

Output of python -m presumm.run_summarization --help

usage: run_summarization.py [-h] --documents_dir DOCUMENTS_DIR [--summaries_output_dir SUMMARIES_OUTPUT_DIR] [--compute_rouge COMPUTE_ROUGE]
                            [--no_cuda NO_CUDA] [--batch_size BATCH_SIZE] [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
                            [--beam_size BEAM_SIZE] [--alpha ALPHA] [--block_trigram BLOCK_TRIGRAM]

optional arguments:
  -h, --help            show this help message and exit
  --documents_dir DOCUMENTS_DIR
                        The folder where the documents to summarize are located.
  --summaries_output_dir SUMMARIES_OUTPUT_DIR
                        The folder in wich the summaries should be written. Defaults to the folder where the documents are
  --compute_rouge COMPUTE_ROUGE
                        Compute the ROUGE metrics during evaluation. Only available for the CNN/DailyMail dataset.
  --no_cuda NO_CUDA     Whether to force the execution on CPU.
  --batch_size BATCH_SIZE
                        Batch size per GPU/CPU for training.
  --min_length MIN_LENGTH
                        Minimum number of tokens for the summaries.
  --max_length MAX_LENGTH
                        Maixmum number of tokens for the summaries.
  --beam_size BEAM_SIZE
                        The number of beams to start with for each example.
  --alpha ALPHA         The value of alpha for the length penalty in the beam search.
  --block_trigram BLOCK_TRIGRAM
                        Whether to block the existence of repeating trigrams in the text generated by beam search.

Notes

  • --file_type pdf is only available on linux and requires poppler-utils to be installed

PDF Structure

PDFs must be formatted in a specific way for this program to function. This program works with two levels of headings: chapter headings and body headings. Chapter headings contain many body headings and each body heading contains many lines of body text. If your PDF file is organized in this way and you can find unique font styles in the XML representation, then this program should work.

Sometimes italics or other stylistic fonts may be represented by separate font numbers. If this is the case simply run the command and pass in multiple font styles: python main.py book.xml -cf 5 50 -bhf 23 34 60 -bf 11 132.

Meta

Hayden Housen – haydenhousen.com

Distributed under the GPLv3 license. See the LICENSE for more information.

https://github.com/HHousen

PreSumm code extensively borrowed from Hugging Face Transformers Library.

Contributing

All Pull Requests are greatly welcomed.

Questions? Commends? Issues? Don't hesitate to open an issue and briefly describe what you are experiencing (with any error logs if necessary). Thanks.

  1. Fork it (https://github.com/HHousen/docsum/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

To Do

  • Make DocSum more robust to different PDF types (multi-layered headings)
  • Implement other summarization techniques
  • Implement automatic header detection (Possibly this paper)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].