nakhunchumpolsathien / TR-TPBS

Licence: MIT license
A Dataset for Thai Text Summarization with over 310K articles.

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to TR-TPBS

Adversarial video summary
Unofficial PyTorch Implementation of SUM-GAN from "Unsupervised Video Summarization with Adversarial LSTM Networks" (CVPR 2017)
Stars: ✭ 187 (+648%)
Mutual labels:  summarization
pn-summary
A well-structured summarization dataset for the Persian language!
Stars: ✭ 29 (+16%)
Mutual labels:  summarization
BillSum
US Bill Summarization Corpus
Stars: ✭ 31 (+24%)
Mutual labels:  summarization
Pyrouge
A Python wrapper for the ROUGE summarization evaluation package
Stars: ✭ 192 (+668%)
Mutual labels:  summarization
SRB
Code for "Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization"
Stars: ✭ 41 (+64%)
Mutual labels:  summarization
FocusSeq2Seq
[EMNLP 2019] Mixture Content Selection for Diverse Sequence Generation (Question Generation / Abstractive Summarization)
Stars: ✭ 109 (+336%)
Mutual labels:  summarization
Multi News
Large-scale multi-document summarization dataset and code
Stars: ✭ 158 (+532%)
Mutual labels:  summarization
DynamicEntitySummarization-DynES
Dynamic Entity Summarization (DynES)
Stars: ✭ 21 (-16%)
Mutual labels:  summarization
ConDigSum
Code for EMNLP 2021 paper "Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization"
Stars: ✭ 62 (+148%)
Mutual labels:  summarization
DeepChannel
The pytorch implementation of paper "DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization"
Stars: ✭ 24 (-4%)
Mutual labels:  summarization
Text summarization with tensorflow
Implementation of a seq2seq model for summarization of textual data. Demonstrated on amazon reviews, github issues and news articles.
Stars: ✭ 226 (+804%)
Mutual labels:  summarization
Summarization Papers
Summarization Papers
Stars: ✭ 238 (+852%)
Mutual labels:  summarization
Machine-Learning-Notes
Lecture Notes of Andrew Ng's Machine Learning Course
Stars: ✭ 60 (+140%)
Mutual labels:  summarization
Textrank
🌀 ⚡️ 🌍 TextRank (automatic text summarization) for PHP8
Stars: ✭ 193 (+672%)
Mutual labels:  summarization
TitleStylist
Source code for our "TitleStylist" paper at ACL 2020
Stars: ✭ 72 (+188%)
Mutual labels:  summarization
Cx db8
a contextual, biasable, word-or-sentence-or-paragraph extractive summarizer powered by the latest in text embeddings (Bert, Universal Sentence Encoder, Flair)
Stars: ✭ 164 (+556%)
Mutual labels:  summarization
Paribhasha
paribhasha.herokuapp.com/
Stars: ✭ 21 (-16%)
Mutual labels:  summarization
hf-experiments
Experiments with Hugging Face 🔬 🤗
Stars: ✭ 37 (+48%)
Mutual labels:  summarization
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+264%)
Mutual labels:  summarization
factsumm
FactSumm: Factual Consistency Scorer for Abstractive Summarization
Stars: ✭ 83 (+232%)
Mutual labels:  summarization

TR-TPBS

A dataset for Thai text summarization.

Update 23 Nov. 2020

The official and larger version of this dataset, called ThaiSum, can be found in this repo. It also comes with several trained models available to download.

Download TR-TPBS Dataset

File Remark Size
TR-TPBS contains title, body, summary, labels, tags and url columns. 2.05 GB

Additional Datasets

These two files are the previous versions of TR-TPBS, before being combined. Be noted that the articles in these files are preprocessed with slightly different filtering-out conditions of that TR-TPBS. The number in the end of datasets’ name indicates the approximate number of articles contained in each dataset. The newest articles contained in these two files are published online up to December 2019.

File Remark Size
Thairath-222k contains title, body, summary, labels, tags, url and date columns. 1.72 GB
ThaiPBS-111k contains similar columns as Thairath-222k’s except date. 0.51 GB

Need Pretrained Models for Research Purpose ?

If you would like to obtain pretrained summarization models for research purposes, please contact nakhun.chum(at sign)gmail.com. The following pretrained models are available upon request:

Model Source code Size
ARedSum-base ARedSumSentRank 2.2 GB
ARedSum-CTX 738 MB
BertSumExt BertSum 2.2 GB
BertSumAbs 3.7 GB
BertSumExtAbs 3.7 GB

Introduction

TR-TPBS is a medium-size dataset, a multi-purpose NLP benchmark, especially for Thai language. This dataset is crawled from Thairath (TR) and ThaiPBS (TPBS) news websites. The main objective of this corpus is for Thai text summarization.

This dataset is the largest news dataset for Thai text summarization since the previous studies on this topic, as far as we know, used small size of dataset up to 500 documents. It was understandable because those studies were based on statistic methods not sequence-to-sequence ones. It didn’t require a large text for training. Therefore, our experiment is the very first study that experimented Thai text summarization with deep learning methods on the largest Thai text summarization dataset. We explored this dataset on both extractive and abstractive methods.

Apart from text summarization objective, TR-TPBS can be used for several other NLP tasks e.g. headline generation, news classification and keyphrase extraction (which may need additional pre-processing).

Dataset Properties

See exploration.ipynb

Experiment Settings and Results

We evaluate the performance of the TR-TPBS dataset using existing extractive and abstractive baselines. Please refer to PreSum, BertSum, and ARedSum for more technical information and their implementation codes.

Exeriment Settings

Both abstractive and extractive Bert-based (including ARedSum) summarization models are trained on a single GPU (NVIDIA TITAN RTX).

Extractive settings

Both BertSumExt and ARedSum models were trained for 100,000 steps with 6000 batch size. The rest of training settings are set identically to BertSum. It took approximately 80 hours to train each extractive model.

Abstractive Settings

All abstractive models were trained for 300,000 steps with 1120 batch size for Bert-based models and 1200 for Tranformers-based models. The rest of training settings are set identically to PreSum. It took approximately 150 hours to train each abstractive model.

We used ‘bert-base-multilingual-cased’ version of BERT in this experiment. We strongly suggest to train all Bert-based models on multiple GPUs for shorten the training time and the better results.

Results

ROUGE F1 of R1 R2 and RL are used to report these experimental results.

TR-TPBS (test set):
Models R1 R2 RL
Extractive
Oracle 50.89 22.10 50.74
Lead-2 42.98 22.71 42.94
ARedSum 40.35 20.38 40.30
BertSumExt 44.58 20.26 44.51
Abstractive
BertSumAbs 51.09 26.92 51.04
BertSumExtAbs 53.19 28.19 53.13

Collected and Preposessed by

License

TR-TPBS, Thairath-222k and ThaiPBS-111k datasets are licensed under MIT License.

Cite this work

@mastersthesis{chumpolsathien_2020, 
    title={Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization},
    author={Chumpolsathien, Nakhun}, 
    year={2020}, 
    school={Beijing Institute of Technology}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].