All Projects → jcblaisecruz02 → Filipino-Text-Benchmarks

jcblaisecruz02 / Filipino-Text-Benchmarks

Licence: GPL-3.0 License
Open-source benchmark datasets and pretrained transformer models in the Filipino language.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Filipino-Text-Benchmarks

Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+10059.09%)
Mutual labels:  text-classification, transfer-learning, bert
SIGIR2021 Conure
One Person, One Model, One World: Learning Continual User Representation without Forgetting
Stars: ✭ 23 (+4.55%)
Mutual labels:  transformer, transfer-learning, bert
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+30154.55%)
Mutual labels:  text-classification, corpus, bert
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+10922.73%)
Mutual labels:  benchmark, corpus, bert
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+940.91%)
Mutual labels:  text-classification, transfer-learning, bert
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+431.82%)
Mutual labels:  text-classification, transformer, bert
tensorflow-ml-nlp-tf2
텐서플로2와 머신러닝으로 시작하는 자연어처리 (로지스틱회귀부터 BERT와 GPT3까지) 실습자료
Stars: ✭ 245 (+1013.64%)
Mutual labels:  transformer, bert, nli
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (+9.09%)
Mutual labels:  text-classification, transformer, bert
MinTL
MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems
Stars: ✭ 61 (+177.27%)
Mutual labels:  transformer, transfer-learning
text-style-transfer-benchmark
Text style transfer benchmark
Stars: ✭ 56 (+154.55%)
Mutual labels:  benchmark, transformer
golgotha
Contextualised Embeddings and Language Modelling using BERT and Friends using R
Stars: ✭ 39 (+77.27%)
Mutual labels:  transformer, bert
ParsBigBird
Persian Bert For Long-Range Sequences
Stars: ✭ 58 (+163.64%)
Mutual labels:  transfer-learning, bert
ganbert-pytorch
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks in Pytorch/HuggingFace
Stars: ✭ 60 (+172.73%)
Mutual labels:  text-classification, bert
transformer-models
Deep Learning Transformer models in MATLAB
Stars: ✭ 90 (+309.09%)
Mutual labels:  transformer, bert
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+586.36%)
Mutual labels:  text-classification, bert
KAREN
KAREN: Unifying Hatespeech Detection and Benchmarking
Stars: ✭ 18 (-18.18%)
Mutual labels:  benchmark, bert
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-4.55%)
Mutual labels:  transformer, bert
FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Stars: ✭ 204 (+827.27%)
Mutual labels:  text-classification, transformer
WSDM-Cup-2019
[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Stars: ✭ 62 (+181.82%)
Mutual labels:  text-classification, bert
textgo
Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!
Stars: ✭ 33 (+50%)
Mutual labels:  text-classification, bert

Filipino-Text-Benchmarks

This repository contains open-source benchmark datasets and pretrained transformer models in the low-resource Filipino language.

Resources and code released in this repository come from the following papers, with more to be added as they are released:

  1. Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation (Cruz et al., 2020)
  2. Establishing Baselines for Text Classification in Low-Resource Languages (Cruz & Cheng, 2020)
  3. Evaluating Language Model Finetuning Techniques for Low-resource Languages (Cruz & Cheng, 2019)

This repository is a continuous work in progress!

Table of Contents

Requirements

Reproducing Results

First, download the data and put it in the cloned repository:

mkdir Filipino-Text-Benchmarks/data

# Hatespeech Dataset
wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/hatenonhate/hatespeech_raw.zip
unzip hatespeech_raw.zip -d Filipino-Text-Benchmarks/data && rm hatespeech_raw.zip

# Dengue Dataset
wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/dengue/dengue_raw.zip
unzip dengue_raw.zip -d Filipino-Text-Benchmarks/data && rm dengue_raw.zip

# NewsPH-NLI Dataset
wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/newsph/newsph-nli.zip
unzip newsph-nli.zip -d Filipino-Text-Benchmarks/data && rm newsph-nli.zip

Sentence Classification Tasks

To finetune for sentence classification tasks, use the train.py script provided in this repository. Here's an example finetuning a Tagalog ELECTRA model on the Hatespeech dataset:

export DATA_DIR='Filipino-Text-Benchmarks/data/hatespeech'

python Filipino-Text-Benchmarks/train.py \
    --pretrained jcblaise/electra-tagalog-small-cased-discriminator \
    --train_data ${DATA_DIR}/train.csv \
    --valid_data ${DATA_DIR}/valid.csv \
    --test_data ${DATA_DIR}/test.csv \
    --data_pct 1.0 \
    --checkpoint finetuned_model \
    --do_train true \
    --do_eval true \
    --msl 128 \
    --optimizer adam \
    --batch_size 32 \
    --add_token [LINK],[MENTION],[HASHTAG] \
    --weight_decay 1e-8 \
    --learning_rate 2e-4 \
    --adam_epsilon 1e-6 \
    --warmup_pct 0.1 \
    --epochs 3 \
    --seed 42

This should give you the following results:

Valid Loss 0.5272
Valid Acc 0.7568
Test Loss 0.3366
Test Accuracy 0.8649

The script will also output checkpoints of the finetuned model at the end of every epoch. These checkpoints can directly be used in a HuggingFace Transformer pipeline or can be loaded via the Transformers package for testing.

To perform multiclass classification, specify the label column names with the --label_column option. Here's an example finetuning a Tagalog ELECTRA model on the Dengue dataset:

export DATA_DIR='Filipino-Text-Benchmarks/data/dengue'

python Filipino-Text-Benchmarks/train.py \
    --pretrained jcblaise/electra-tagalog-small-uncased-discriminator \
    --train_data ${DATA_DIR}/train.csv \
    --valid_data ${DATA_DIR}/valid.csv \
    --test_data ${DATA_DIR}/test.csv \
    --label_columns absent,dengue,health,mosquito,sick \
    --data_pct 1.0 \
    --checkpoint finetuned_model \
    --do_train true \
    --do_eval true \
    --msl 128 \
    --optimizer adam \
    --batch_size 32 \
    --add_token [LINK],[MENTION],[HASHTAG] \
    --weight_decay 1e-8 \
    --learning_rate 2e-4 \
    --adam_epsilon 1e-6 \
    --warmup_pct 0.1 \
    --epochs 3 \
    --seed 42

This should give you the following results:

Valid Loss 0.1586
Valid Acc 0.9414
Test Loss 0.1662
Test Accuracy 0.9375

For more information, run train.py --help for details on each command line argument.

Sentence-Pair Classification Tasks

To finetune for sentence-pair classification (entailment datasets), you can specify the text column names using the --text_column option. Here's an example finetuning an uncased Tagalog ELECTRA model on the NewsPH-NLI dataset:

export DATA_DIR='Filipino-Text-Benchmarks/data/newsph-nli'

python Filipino-Text-Benchmarks/train.py \
    --pretrained jcblaise/electra-tagalog-small-uncased-discriminator \
    --train_data ${DATA_DIR}/train.csv \
    --valid_data ${DATA_DIR}/valid.csv \
    --test_data ${DATA_DIR}/test.csv \
    --text_columns s1,s2 \
    --data_pct 1.0 \
    --checkpoint finetuned_model \
    --do_train true \
    --do_eval true \
    --msl 128 \
    --optimizer adam \
    --batch_size 32 \
    --weight_decay 1e-8 \
    --learning_rate 2e-4 \
    --adam_epsilon 1e-6 \
    --warmup_pct 0.1 \
    --epochs 3 \
    --seed 45

This should give you the following results:

Valid Loss 0.1846
Valid Acc 0.9299
Test Loss 0.1874
Test Accuracy 0.9292

Logging Results

We use Weights & Biases to log experiment results. To use Weights & Biases, you need to toggle three command line arguments:

export DATA_DIR='Filipino-Text-Benchmarks/data/hatespeech'

python Filipino-Text-Benchmarks/train.py \
    --pretrained jcblaise/electra-tagalog-small-uncased-discriminator \
    ...
    --use_wandb \
    --wandb_username USERNAME \
    --wandb_project_name PROJECT_NAME

Replace USERNAME with your Weights & Biases username, and the PROJECT_NAME with the name of the project you're logging in to.

Hyperparameter Search

To reproduce hyperparameter search results, you need to use the sweep function of Weights & Biases. For an example sweep, a sample_sweep.yaml file is included. This configuration sweeps for good random seeds on the Hatespeech dataset. Edit the file to your specifications as needed. For more information, check the documentation.

To start a sweep, make sure to login first via the terminal, then run:

wandb sweep -p PROJECT_NAME Filipino-Text-Benchmarks/sample_sweep.yaml

This creates a sweep agent that you can run. Perform the sweep by running:

cd Filipino-Text-Benchmarks && wandb agent USERNAME/PROJECT_NAME/SWEEP_ID

where SWEEP_ID is the id generated by the wandb sweep command, and USERNAME is your W&B username. Make sure that you have the necessary data files in the data/ folder inside the repository when running this example sweep.

Demos

Finetuned Entailment Model Demo

We provide a finetuned version of the small-uncased ELECTRA model for the NewsPH-NLI sentence entailment task for demo purposes. Here's how to load it to make predictions:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the finetuned model
pretrained = 'jcblaise/electra-tagalog-small-uncased-discriminator-newsphnli'
tokenizer = AutoTokenizer.from_pretrained(pretrained)
model = AutoModelForSequenceClassification.from_pretrained(pretrained)

# Entailment Example
s1 = "Ayon sa mga respondents, nahihirapan daw ang pag-rescue sa mga biktima dahil sa flash flood."
s2 = "Kumuha ng tulong ang respondents sa Philippine Red Cross para sa mga lifeboat."

tokens = tokenizer([(s1, s2)], padding='max_length', truncation='longest_first', max_length=128, return_tensors='pt')
with torch.no_grad():
    out = model(**tokens)[0]
pred = out.argmax(1).item() # Outputs "0" which means "entailment"

# Contradiction example
s1 = "Ayon sa mga respondents, nahihirapan daw ang pag-rescue sa mga biktima dahil sa flash flood."
s2 = "Nagpulong ang mga guro sa National High School para sa papadating na Bridaga Eskwela"

tokens = tokenizer([(s1, s2)], padding='max_length', truncation='longest_first', max_length=128, return_tensors='pt')
with torch.no_grad():
    out = model(**tokens)[0]
pred = out.argmax(1).item() # Outputs "1" which means "contradiction"

Using HuggingFace Pipelines

You can directly use the BERT and ELECTRA models in pipelines. Here's an example for mask filling:

from transformers import pipeline

pipe = pipeline('fill-mask', model='jcblaise/electra-tagalog-base-cased-generator')
s = "Ito ang Pilipinas, ang aking [MASK] Hinirang"
pipe(s)

# [{'score': 0.9990490078926086,
#   'sequence': '[CLS] Ito ang Pilipinas, ang aking Lupang Hinirang [SEP]',
#   'token': 21327,
#   'token_str': 'Lupang'},
#  ...

Here's the same example but without using pipelines:

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the pretrained model
pretrained = 'jcblaise/electra-tagalog-base-cased-generator'
tokenizer = AutoTokenizer.from_pretrained(pretrained)
model = AutoModelForMaskedLM.from_pretrained(pretrained)

s = "Ito ang Pilipinas, ang aking [MASK] Hinirang"
inputs = tokenizer(s, return_tensors='pt')
with torch.no_grad():
    out = model(**inputs)

pred_ix = out['logits'][0, 7].argmax() 
pred = tokenizer.convert_ids_to_tokens([pred_ix])[0] # Outputs "Lupang"

Datasets

  • NewsPH-NLI Dataset download
    Sentence Entailment Dataset in Filipino
    First benchmark dataset for sentence entailment in the low-resource Filipino language. Constructed through exploting the structure of news articles. Contains 600,000 premise-hypothesis pairs, in 70-15-15 split for training, validation, and testing. Originally published in (Cruz et al., 2020).

  • WikiText-TL-39 download
    Large Scale Unlabeled Corpora in Filipino
    Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means "Tagalog." Originally published in Cruz & Cheng (2019).

  • Hate Speech Dataset download
    Text Classification Dataset in Filipino
    Contains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections and originally used in Cabasag et al. (2019).

  • Dengue Dataset download
    Low-Resource Multiclass Text Classification Dataset in Filipino
    Benchmark dataset for low-resource multiclass classification, with 4,015 training, 500 testing, and 500 validation examples, each labeled as part of five classes. Each sample can be a part of multiple classes. Collected as tweets and originally used in Livelo & Cheng (2018).

Pretrained ELECTRA Models

We release new ELECTRA models in small and base configurations, with both the discriminator and generators available. All the models follow the same setups and were trained with the same hyperparameters as English ELECTRA models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. These models were released as part of (Cruz et al., 2020).

Discriminator Models

Generator Models

The models can be loaded using the code below:

from transformers import TFAutoModel, AutoModel, AutoTokenizer

# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-generator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-generator')

# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-generator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-generator')

Pretrained BERT Models

We release four Tagalog BERT Base models and one Tagalog DistilBERT Base model. All the models use the same configurations as the original English BERT models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. These models were released as part of Cruz & Cheng (2019).

The models can be loaded using the code below:

from transformers import TFAutoModel, AutoModel, AutoTokenizer

# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased')

# PyTorch
model = AutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased')

Other Pretrained Models

  • ULMFiT-Tagalog download
    Tagalog pretrained AWD-LSTM compatible with v2 of the FastAI library. Originally published in Velasco (2020).

  • ULMFiT-Tagalog (Old) download
    Tagalog pretrained AWD-LSTM compatible with the FastAI library. Originally published in Cruz & Cheng (2019).

Citations

If you found our work useful, please make sure to cite!

@article{cruz2020investigating,
  title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation}, 
  author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
  journal={arXiv preprint arXiv:2010.11574},
  year={2020}
}

@article{cruz2020establishing,
  title={Establishing Baselines for Text Classification in Low-Resource Languages},
  author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
  journal={arXiv preprint arXiv:2005.02068},
  year={2020}
}

@article{cruz2019evaluating,
  title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
  author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
  journal={arXiv preprint arXiv:1907.00409},
  year={2019}
}

Related Repositories

Repositories for adjacent projects and papers made by our team that use the models found here:

Contributions and Acknowledgements

Should you find any bugs or have any suggestions, feel free to drop by the Issues tab! We'll get back to you as soon as we can.

This repository is managed by the De La Salle University Machine Learning Group

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].