All Projects → pkouris → abtextsum

pkouris / abtextsum

Licence: other
Abstractive text summarization based on deep learning and semantic content generalization

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to abtextsum

NLP-Extractive-NEWS-summarization-using-MMR
A simple python implementation of the Maximal Marginal Relevance (MMR) baseline system for text summarization.
Stars: ✭ 59 (+321.43%)
Mutual labels:  text-summarization
DocSum
A tool to automatically summarize documents abstractively using the BART or PreSumm Machine Learning Model.
Stars: ✭ 58 (+314.29%)
Mutual labels:  text-summarization
NLP Toolkit
Library of state-of-the-art models (PyTorch) for NLP tasks
Stars: ✭ 92 (+557.14%)
Mutual labels:  text-summarization
Persian-Summarization
Statistical and Semantical Text Summarizer in Persian Language
Stars: ✭ 38 (+171.43%)
Mutual labels:  text-summarization
allsummarizer
Multilingual automatic text summarizer using statistical approach and extraction
Stars: ✭ 28 (+100%)
Mutual labels:  text-summarization
Text-Summarization-Repo
텍스트 요약 분야의 주요 연구 주제, Must-read Papers, 이용 가능한 model 및 data 등을 추천 자료와 함께 정리한 저장소입니다.
Stars: ✭ 213 (+1421.43%)
Mutual labels:  text-summarization
email-summarization
A module for E-mail Summarization which uses clustering of skip-thought sentence embeddings.
Stars: ✭ 81 (+478.57%)
Mutual labels:  text-summarization
TextSummarizer
TextRank implementation for C#
Stars: ✭ 29 (+107.14%)
Mutual labels:  text-summarization
nlp-akash
Natural Language Processing notes and implementations.
Stars: ✭ 66 (+371.43%)
Mutual labels:  text-summarization
Entity2Topic
[NAACL2018] Entity Commonsense Representation for Neural Abstractive Summarization
Stars: ✭ 20 (+42.86%)
Mutual labels:  text-summarization
Bidirectiona-LSTM-for-text-summarization-
A bidirectional encoder-decoder LSTM neural network is trained for text summarization on the cnn/dailymail dataset. (MIT808 project)
Stars: ✭ 73 (+421.43%)
Mutual labels:  text-summarization
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (+171.43%)
Mutual labels:  text-summarization
PlanSum
[AAAI2021] Unsupervised Opinion Summarization with Content Planning
Stars: ✭ 25 (+78.57%)
Mutual labels:  text-summarization
Sumrized
Automatic Text Summarization (English/Arabic).
Stars: ✭ 37 (+164.29%)
Mutual labels:  text-summarization
gazeta
Gazeta: Dataset for automatic summarization of Russian news / Газета: набор данных для автоматического реферирования на русском языке
Stars: ✭ 25 (+78.57%)
Mutual labels:  text-summarization
Scripts-for-extractive-summarization
Scripts for an upcoming blog "Extractive vs. Abstractive Summarization" for RaRe Technologies.
Stars: ✭ 12 (-14.29%)
Mutual labels:  text-summarization
Intelligent Document Finder
Document Search Engine Tool
Stars: ✭ 45 (+221.43%)
Mutual labels:  text-summarization
summarize-webpage
A small NLP SAAS project that summarize a webpage
Stars: ✭ 34 (+142.86%)
Mutual labels:  text-summarization
Brief
In a nutshell, this is a Text Summarizer
Stars: ✭ 29 (+107.14%)
Mutual labels:  text-summarization
xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Stars: ✭ 160 (+1042.86%)
Mutual labels:  text-summarization

abtextsum

This source code has been used in the experimental procedure of the following paper:

Panagiotis Kouris, Georgios Alexandridis, Andreas Stafylopatis. 2019. Abstractive text summarization based on deep learning and semantic content generalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5082-5092.

This paper is accessible in the Proceedings of the 57th ACL Annual Meeting (2019) or directly from here.


For citing, the BibTex follows:

@inproceedings{kouris2019abstractive,
  title={Abstractive text summarization based on deep learning and semantic content generalization},
  author={Kouris, Panagiotis and Alexandridis, Georgios and Stafylopatis, Andreas},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  month = jul,
  year={2019}
  address = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
  url = {https://www.aclweb.org/anthology/P19-1501},
  pages={5082--5092},
}



Code Description

The code described below follows the methodology and the assumptions which are described in detail in the aforementioned paper. The experimental procedure, as it is described in the paper, requires as initial dataset for training, validation and testing the Gigaword dataset as it is described by Rush et. al. 2015 (see references in the paper). Also for testing, the DUC 2004 dataset is used as this is also described in the paper.
According to the paper, the initial dataset is preprocessed furthermore and generalized to one of the proposed text generalization strategies (e.g. NEG100 or LG200d5). Then the generalized dataset is used for training where the deep learning model learns to predict a generalized summary.
In the phase of testing, a generalized article (e.g. an article of the test set) is given as input to the deep learning model which predicts the respective generalized summary. Then, in the phase of post-processing, the generalized concepts of the generalized summary are replaced by the specific concepts of the original (preprocessed) article producing the final summary.

The workflow of this framework follows:

  1. Preprocessing of the dataset
    The task of preprocessing of the dataset is performed by DataPreprocessing class (preprocessing.py file). The method clean_dataset() is used for preprocessing the Gigaword dataset while the method clean_duc_dataset_from_original_to_cleaned() is used for DUC dataset.

  2. Text generalization
    Both text generalization tasks, NEG and LG, are performed by DataPreprocessing class (preprocessing.py file).
    Firstly, part-of-speach tagging is required which is performed by pos_tagging_of_dataset_and_vocabulary_of_words_pos_frequent() method for Gigaword dataset and pos_tagging_of_duc_dataset_and_vocab_pos_frequent() method for DUC dataset. Then the NEG and LG strategy can be applied as follows:

    1. NEG Strategy
      The annotation of named entities is performed by ner_of_dataset_and_vocabulary_of_ner_words() method for Gigaword dataset and ner_of_duc_dataset_and_vocab_of_ne() method for DUC dataset. Then the methods conver_dataset_with_ner_from_stanford_and_wordnet() for Gigaword dataset and conver_duc_dataset_with_ner_from_stanford_and_wordnet() for DUC dataset generalize these datasets according to NEG strategy having set the parameters accordingly.

    2. LG Strategy
      The word_freq_hypernym_paths() method produces a file that contains a vocabulary with the frequency and the hypernym path of each word. Then this file is used by vocab_based_on_hypernyms() method in order to produce a file that contains a vocabulary with those words that are candidates for generalization. Finally, for the Gigaword dataset, the convert_dataset_to_general() method produces the files with summary-article pairs which constitute the generalized dataset, while for DUC dataset the convert_duc_dataset_based_on_level_of_generalizetion() method is used. The hyperparameters of these methods should be set accordingly.

  3. Building dataset for training, validation and testing
    The BuildDataset class (build_dataset.py file) creates the files which are given as input to the deep learning model for training, validation or testing.
    To build the dataset, the appropriate file paths should be set in the __inint__() of BuildDataset class executing the following commands, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).

    1. Building the training dataset: python build_dataset.py -mode train -model lg100d5g
    2. Building the validation dataset: python build_dataset.py -mode validation -model lg100d5g
    3. Building the testing dataset: python build_dataset.py -mode test -model lg100d5g
  4. Training
    The process of training is performed by Train Class (file train_v2.py) having set the hyperparameters accordingly. The files which are produced from the previous step of Building dataset are used as input in this phase of training. The process of training is performed by the command: python train.py -model neg100, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).

  5. Post-processing of generalized summaries
    In the phase of testing, the task of post-processing of the generalized summaries, which are produced by the deep learning model, is required to replace the generalized concepts of the generalized summary with the specific ones from the corresponding original articles. This task is performed by PostProcessing class by setting the parameters in __init__() method accordingly. More specifically, the mode should be set to "lg" or "neg" according to the employed text generalization strategy. Also, the hyperparameters of neg_postprocessing() and lg_postprocessing() methods for file paths, text similarity function and the context window should be set accordingly.

  6. Testing
    The Testing class (file testing.py) performs the process of testing of this framework. For the Gigaword dataset, a subset of its test set (e.g. 4000 instances) should be used in order to evaluate the framework while for the DUC dataset, the whole set of instances is used. The Testing class requires the official ROUGE package for measuring the performance of the proposed framework.
    In order to perform the task of testing, the appropriate file paths should be set in the __init__() of Testing class running one of the following modes:

    1. Testing for gigaword: python testing.py -mode gigaword
    2. Testing for duc: python testing.py -mode duc
    3. Testing for duc capped to 75 bytes: python testing.py -mode duc75b

Setting parameters and paths
The values of hyperparameters should be specified in the file parameters.py, while the paths of the corresponding files should be set in the file paths.py.
Additionally, a file with word embeddings (e.g. word2vec) is required where its file path and the dimension of the vectors (e.g. 300) should be specified in the files paths.py and parameters.py, respectively.

The project was developed in python 3.5 and the required python packages are included in the file requirements.txt.

The above described code includes the functionality that was used in the experimental procedure of the corresponding paper. However, the proposed framework is not limited by the current implementation as it is based on a well defined theoretical model that may provide the possibility of enhancing its performance by extending or improving this implementation (e.g. using a better taxonomy of concepts, a different machine learning model or an alternative similarity method for the post-processing task).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].