All Projects → IlyaGusev → Summarus

IlyaGusev / Summarus

Licence: apache-2.0
Models for automatic abstractive summarization

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Summarus

Paribhasha
paribhasha.herokuapp.com/
Stars: ✭ 21 (-74.7%)
Mutual labels:  summarization, nlp-machine-learning
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (+72.29%)
Mutual labels:  nlp-machine-learning, summarization
Textrank
TextRank implementation for Python 3.
Stars: ✭ 1,008 (+1114.46%)
Mutual labels:  summarization
Potara
Multi-document summarization tool relying on ILP and sentence fusion
Stars: ✭ 72 (-13.25%)
Mutual labels:  summarization
Argument Reasoning Comprehension Task
The Argument Reasoning Comprehension Task: Source codes & Datasets
Stars: ✭ 57 (-31.33%)
Mutual labels:  nlp-machine-learning
Mitie chinese wikipedia corpus
Pre-trained Wikipedia corpus by MITIE
Stars: ✭ 43 (-48.19%)
Mutual labels:  nlp-machine-learning
Aiops platform
An Artificial Intelligence Platform for IT Operations.
Stars: ✭ 63 (-24.1%)
Mutual labels:  nlp-machine-learning
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-53.01%)
Mutual labels:  nlp-machine-learning
Russian news corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (-8.43%)
Mutual labels:  nlp-machine-learning
Wongnai Corpus
Collection of Wongnai's datasets
Stars: ✭ 57 (-31.33%)
Mutual labels:  nlp-machine-learning
Nlp Paper
自然语言处理领域下的对话语音领域,整理相关论文(附阅读笔记),复现模型以及数据处理等(代码含TensorFlow和PyTorch两版本)
Stars: ✭ 67 (-19.28%)
Mutual labels:  nlp-machine-learning
Text Classification Keras
📚 Text classification library with Keras
Stars: ✭ 53 (-36.14%)
Mutual labels:  nlp-machine-learning
News push project
Real Time News Scraping and Recommendation System - React | Tensorflow | NLP | News Scrapers
Stars: ✭ 44 (-46.99%)
Mutual labels:  nlp-machine-learning
Awesome machine learning solutions
A curated list of repositories for my book Machine Learning Solutions.
Stars: ✭ 65 (-21.69%)
Mutual labels:  summarization
Predicting Myers Briggs Type Indicator With Recurrent Neural Networks
Stars: ✭ 43 (-48.19%)
Mutual labels:  nlp-machine-learning
Sotawhat
Returns latest research results by crawling arxiv papers and summarizing abstracts. Helps you stay afloat with so many new papers everyday.
Stars: ✭ 1,181 (+1322.89%)
Mutual labels:  summarization
Tika Python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Stars: ✭ 997 (+1101.2%)
Mutual labels:  nlp-machine-learning
Lexrankr
LexRank for Korean.
Stars: ✭ 50 (-39.76%)
Mutual labels:  summarization
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-26.51%)
Mutual labels:  nlp-machine-learning
Crd3
The repo containing the Critical Role Dungeons and Dragons Dataset.
Stars: ✭ 83 (+0%)
Mutual labels:  summarization

summarus

Build Status Code Climate Gitter

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

Based on the following papers:

Contacts

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

Argument Required Description
-c true path to file with configuration
-s true path to directory where model will be saved
-t true path to train dataset
-v true path to val dataset
-r false recover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

Argument Required Default Description
-t true path to test dataset
-m true path to tar.gz archive with model
-p true name of Predictor
-c false 0 CUDA device
-L true Language ("ru" or "en")
-b false 32 size of a batch with test examples to run simultaneously
-M false path to meteor.jar for Meteor metric
-T false tokenize gold and predicted summaries before metrics calculation
-D false save temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

Argument Default Description
--train-path path to train dataset
--model-path path to directory where generated subword model will be saved
--model-type bpe type of subword model, see sentencepiece
--vocab-size 50000 size of the resulting subword model vocabulary
--config-path path to file with configuration for DatasetReader (with parse_set)

Headline generation

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru 

Results:

Train dataset: RIA, test dataset: RIA
Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 40.0 23.3 37.5 52.6
ria_pgn_24kk 42.3 25.1 39.6 54.2
ria_mbart 42.8 25.5 39.9 55.1
First Sentence 24.1 10.6 16.7 -

Train dataset: RIA, eval dataset: Lenta

Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 25.6 12.3 23.0 36.1
ria_pgn_24kk 26.4 12.3 24.0 39.8
ria_mbart 30.3 14.5 27.1 43.2
First Sentence 25.5 11.2 19.2 25.5

Summarization - CNN/DailyMail

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
cnndm_pgn_25kk 38.5 16.5 33.4 17.6 47.7

Summarization - Gazeta, russian news dataset

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
gazeta_pgn_7kk 29.4 12.7 24.6 21.2 38.8
gazeta_pgn_7kk_cov 29.8 12.8 25.4 22.1 40.8
gazeta_pgn_25kk 29.6 12.8 24.6 21.5 39
gazeta_pgn_words_13kk 29.4 12.6 24.4 20.9 35.9
gazeta_summarunner_3kk 31.6 13.7 27.1 26.0 46.3
gazeta_mbart 32.6 14.6 28.2 25.7 49.8
gazeta_mbart_lower 32.7 14.7 28.3 25.8 48.7
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].