Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → IlyaGusev → Summarus

IlyaGusev / Summarus

Licence: apache-2.0

Models for automatic abstractive summarization

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning nlp-machine-learning summarization

Projects that are alternatives of or similar to Summarus

Paribhasha

paribhasha.herokuapp.com/

Stars: ✭ 21 (-74.7%)

Mutual labels: summarization, nlp-machine-learning

Onnxt5

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Stars: ✭ 143 (+72.29%)

Mutual labels: nlp-machine-learning, summarization

Textrank

TextRank implementation for Python 3.

Stars: ✭ 1,008 (+1114.46%)

Mutual labels: summarization

Potara

Multi-document summarization tool relying on ILP and sentence fusion

Stars: ✭ 72 (-13.25%)

Mutual labels: summarization

Argument Reasoning Comprehension Task

The Argument Reasoning Comprehension Task: Source codes & Datasets

Stars: ✭ 57 (-31.33%)

Mutual labels: nlp-machine-learning

Mitie chinese wikipedia corpus

Pre-trained Wikipedia corpus by MITIE

Stars: ✭ 43 (-48.19%)

Mutual labels: nlp-machine-learning

Aiops platform

An Artificial Intelligence Platform for IT Operations.

Stars: ✭ 63 (-24.1%)

Mutual labels: nlp-machine-learning

Coursera Natural Language Processing Specialization

Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.

Stars: ✭ 39 (-53.01%)

Mutual labels: nlp-machine-learning

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Stars: ✭ 76 (-8.43%)

Mutual labels: nlp-machine-learning

Wongnai Corpus

Collection of Wongnai's datasets

Stars: ✭ 57 (-31.33%)

Mutual labels: nlp-machine-learning

Nlp Paper

自然语言处理领域下的对话语音领域，整理相关论文（附阅读笔记），复现模型以及数据处理等（代码含TensorFlow和PyTorch两版本）

Stars: ✭ 67 (-19.28%)

Mutual labels: nlp-machine-learning

Text Classification Keras

📚 Text classification library with Keras

Stars: ✭ 53 (-36.14%)

Mutual labels: nlp-machine-learning

News push project

Real Time News Scraping and Recommendation System - React | Tensorflow | NLP | News Scrapers

Stars: ✭ 44 (-46.99%)

Mutual labels: nlp-machine-learning

Awesome machine learning solutions

A curated list of repositories for my book Machine Learning Solutions.

Stars: ✭ 65 (-21.69%)

Mutual labels: summarization

Predicting Myers Briggs Type Indicator With Recurrent Neural Networks

Stars: ✭ 43 (-48.19%)

Mutual labels: nlp-machine-learning

Sotawhat

Returns latest research results by crawling arxiv papers and summarizing abstracts. Helps you stay afloat with so many new papers everyday.

Stars: ✭ 1,181 (+1322.89%)

Mutual labels: summarization

Tika Python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Stars: ✭ 997 (+1101.2%)

Mutual labels: nlp-machine-learning

Lexrankr

LexRank for Korean.

Stars: ✭ 50 (-39.76%)

Mutual labels: summarization

How To Mine Newsfeed Data And Extract Interactive Insights In Python

A practical guide to topic mining and interactive visualizations

Stars: ✭ 61 (-26.51%)

Mutual labels: nlp-machine-learning

Crd3

The repo containing the Critical Role Dungeons and Dragons Dataset.

Stars: ✭ 83 (+0%)

Mutual labels: summarization

View All Similar Projects ➔

summarus

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

Based on the following papers:

Contacts

Gitter chat: summarus/community
Telegram: @YallenGusev

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

Argument	Required	Description
-c	true	path to file with configuration
-s	true	path to directory where model will be saved
-t	true	path to train dataset
-v	true	path to val dataset
-r	false	recover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

Argument	Required	Default	Description
-t	true		path to test dataset
-m	true		path to tar.gz archive with model
-p	true		name of Predictor
-c	false	0	CUDA device
-L	true		Language ("ru" or "en")
-b	false	32	size of a batch with test examples to run simultaneously
-M	false		path to meteor.jar for Meteor metric
-T	false		tokenize gold and predicted summaries before metrics calculation
-D	false		save temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

Argument	Default	Description
--train-path		path to train dataset
--model-path		path to directory where generated subword model will be saved
--model-type	bpe	type of subword model, see sentencepiece
--vocab-size	50000	size of the resulting subword model vocabulary
--config-path		path to file with configuration for DatasetReader (with parse_set)

Headline generation

First paper: Importance of Copying Mechanism for News Headline Generation
Slides: Importance of Copying Mechanism for News Headline Generation
Second paper: Advances of Transformer-Based Models for News Headline Generation

Dataset splits:

RIA original dataset: https://github.com/RossiyaSegodnya/ria_news_dataset
RIA train/val/test: https://www.dropbox.com/s/rermx1r8lx9u7nl/ria.tar.gz
RIA dataset preprocessed for mBART: https://www.dropbox.com/s/iq2ih8sztygvz0m/ria_data_mbart_512_200.tar.gz
Lenta original dataset: https://github.com/yutkin/Lenta.Ru-News-Dataset
Lenta train/val/test: https://www.dropbox.com/s/v9i2nh12a4deuqj/lenta.tar.gz
Lenta dataset preprocessed for mBART: https://www.dropbox.com/s/4oo8jazmw3izqvr/lenta_mbart_data_512_200.tar.gz

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru

Results:

Train dataset: RIA, test dataset: RIA

Model	R-1-f	R-2-f	R-L-f	BLEU
ria_copynet_10kk	40.0	23.3	37.5	52.6
ria_pgn_24kk	42.3	25.1	39.6	54.2
ria_mbart	42.8	25.5	39.9	55.1
First Sentence	24.1	10.6	16.7	-

Train dataset: RIA, eval dataset: Lenta

Model	R-1-f	R-2-f	R-L-f	BLEU
ria_copynet_10kk	25.6	12.3	23.0	36.1
ria_pgn_24kk	26.4	12.3	24.0	39.8
ria_mbart	30.3	14.5	27.1	43.2
First Sentence	25.5	11.2	19.2	25.5

Summarization - CNN/DailyMail

Dataset splits:

CNN/DailyMail jsonl dataset: https://www.dropbox.com/s/35ezpg78rtukkgh/cnn_dm_jsonl.tar.gz

Models:

cnndm_pgn_25kk

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

Model	R-1-f	R-2-f	R-L-f	METEOR	BLEU
cnndm_pgn_25kk	38.5	16.5	33.4	17.6	47.7

Summarization - Gazeta, russian news dataset

Paper: Dataset for Automatic Summarization of Russian News
Gazeta dataset: https://github.com/IlyaGusev/gazeta
Usage examples:

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

Model	R-1-f	R-2-f	R-L-f	METEOR	BLEU
gazeta_pgn_7kk	29.4	12.7	24.6	21.2	38.8
gazeta_pgn_7kk_cov	29.8	12.8	25.4	22.1	40.8
gazeta_pgn_25kk	29.6	12.8	24.6	21.5	39
gazeta_pgn_words_13kk	29.4	12.6	24.4	20.9	35.9
gazeta_summarunner_3kk	31.6	13.7	27.1	26.0	46.3
gazeta_mbart	32.6	14.6	28.2	25.7	49.8
gazeta_mbart_lower	32.7	14.7	28.3	25.8	48.7

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 83

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗