All Projects → Ermlab → PoLitBert

Ermlab / PoLitBert

Licence: MIT license
Polish RoBERTA model trained on Polish literature, Wikipedia, and Oscar. The major assumption is that quality text will give a good model.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to PoLitBert

erc
Emotion recognition in conversation
Stars: ✭ 34 (+36%)
Mutual labels:  roberta
krnnt
Polish morphological tagger.
Stars: ✭ 33 (+32%)
Mutual labels:  polish
japanese-pretrained-models
Code for producing Japanese pretrained models provided by rinna Co., Ltd.
Stars: ✭ 484 (+1836%)
Mutual labels:  roberta
Albert zh
A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型
Stars: ✭ 3,500 (+13900%)
Mutual labels:  roberta
Bertviz
Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)
Stars: ✭ 3,443 (+13672%)
Mutual labels:  roberta
RoBERTaABSA
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.
Stars: ✭ 112 (+348%)
Mutual labels:  roberta
koclip
KoCLIP: Korean port of OpenAI CLIP, in Flax
Stars: ✭ 80 (+220%)
Mutual labels:  roberta
roberta-wwm-base-distill
this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large
Stars: ✭ 61 (+144%)
Mutual labels:  roberta
pl.javascript.info
Modern JavaScript Tutorial in Polish
Stars: ✭ 30 (+20%)
Mutual labels:  polish
Transformer-QG-on-SQuAD
Implement Question Generator with SOTA pre-trained Language Models (RoBERTa, BERT, GPT, BART, T5, etc.)
Stars: ✭ 28 (+12%)
Mutual labels:  roberta
Chinese Bert Wwm
Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型)
Stars: ✭ 6,357 (+25328%)
Mutual labels:  roberta
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+9600%)
Mutual labels:  roberta
vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Stars: ✭ 22 (-12%)
Mutual labels:  roberta
CLUE pytorch
CLUE baseline pytorch CLUE的pytorch版本基线
Stars: ✭ 72 (+188%)
Mutual labels:  roberta
RECCON
This repository contains the dataset and the PyTorch implementations of the models from the paper Recognizing Emotion Cause in Conversations.
Stars: ✭ 126 (+404%)
Mutual labels:  roberta
KLUE
📖 Korean NLU Benchmark
Stars: ✭ 420 (+1580%)
Mutual labels:  roberta
validate-polish
Utility library for validation of PESEL, NIP, REGON, identity card etc. Aimed mostly at Polish enviroment. [Polish] Walidacja numerów pesel, nip, regon, dowodu osobistego.
Stars: ✭ 31 (+24%)
Mutual labels:  polish
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (+52%)
Mutual labels:  roberta
openroberta-lab
The programming environment »Open Roberta Lab« by Fraunhofer IAIS enables children and adolescents to program robots. A variety of different programming blocks are provided to program motors and sensors of the robot. Open Roberta Lab uses an approach of graphical programming so that beginners can seamlessly start coding. As a cloud-based applica…
Stars: ✭ 98 (+292%)
Mutual labels:  roberta
les-military-mrc-rank7
莱斯杯:全国第二届“军事智能机器阅读”挑战赛 - Rank7 解决方案
Stars: ✭ 37 (+48%)
Mutual labels:  roberta

PoLitBert - Polish RoBERTa model

Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that good quality text will give good model.

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Table of Contents

Experiments setup and goals

During experiments, we want to examine:

  • impact of different learning schedulers for training speed and accuracy, tested:
    • linear schedule with warmup
    • cyclic schedule: cosine, triangular
  • impact of training time on final accuracy

Data

Data processing for training

Our main assumption is that good quality text should produce good language model. So far the most popular polish dataset was "Polish wikipedia dump" however this text characterize with formal language. Second source of text is polish part of Oscar corpus - crawled text from the polish internet. When we investigate this corpus with more details it appears that it contains a lot of: foreign sentences (in Russian, English, German etc.), too short sentences and not grammatical sentences (as words enumerations).

We prepared a few cleaning heuristics:

  • remove sentences shorter than
  • remove non polish sentences
  • remove ungrammatical sentences (without verbs and with too many nouns)
  • perform sentence tokenization and save each sentence in new line, after each document the new line was added

Data was cleaned with use of process_sentences.py script, the whole process is presented in the polish_process_data.ipynb notebook.

Summary of Cleaned Polish Oscar corpus

File All lines All sentences Invalid length sent. Non-polish sent. Ungrammatical sent. Valid sentences
corpus_oscar_2020-04-10_32M_lines.txt 32 000 506 94 332 394 1 796 371 296 093 8 100 750 84 139 180
corpus_oscar_2020-04-10_64M_lines.txt 32 000 560 96 614 563 1 777 586 491 789 7 869 507 86 475 681
corpus_oscar_2020-04-10_96M_lines.txt 32 001 738 96 457 553 1 796 083 302 598 7 908 090 86 450 782
corpus_oscar_2020-04-10_128M_lines.txt 32 002 212 97 761 040 1 919 071 305 924 7 891 846 87 644 199
corpus_oscar_2020-04-10_128M_above_lines.txt 17 519 467 53 446 884   1 090 714 212 657 4 343 296 47 800 217

Training, testing dataset stats

Train Corpus Lines Words Characters
Polish Wikipedia (2020-03) 11 748 343 181 560 313 1 309 416 493
Books 81 140 395 829 404 801 5 386 053 287
Oscar (32M part, cleared) 112 466 497 1 198 735 834 8 454 177 161
Total 205 355 235 2 209 700 948 15 149 646 941

For testing we take ~10% of each corpus

Test Corpus Lines Words Characters
Polish Wikipedia (2020-03) 1 305 207 21 333 280 155 403 453
Books 9 007 716 93 141 853 610 111 989
Oscar (32M part, cleared) 14 515 735 157 303 490 1 104 855 397
Total 24 828 658 271 778 623 1 870 370 839

Training Polish RoBERTA protocol with Fairseq

General recipe of the final data preparation and model training process:

  1. Prepare huge text file data.txt e.g. Wikipedia text, where each sentence is in a new line and each article is separated by two new lines.
  2. Take 10-15M lines and prepare another file for sentencepiece (vocabulary builder) - again, each sentence is in one line.
  3. Train sentencepiece vocabulary and save it in fairseq format vocab.fairseq.txt.
  4. Encode data.txt with trained sentencepiece model to data.sp.txt.
  5. Preprocess data.sp.txt with fairseq-preprocess.
  6. Run training.

Detailed data preparation steps for fairseq (vocab gen and binarization) are available in separate notebook polish_roberta_vocab.ipynb.

Commands needed to reproduce fairseq models with various training protocols may be found in polish_roberta_training.ipynb.

Pretrained models and vocabs

KLEJ evaluation

All models were evaluated at 26.07.2020 with 9 KLEJ benchmark tasks . Below results were achieved with use of fine-tuning scripts from Polish RoBERTa without any further tweaks. which suggests that the potential of the models may not been fully utilized yet.

Model NKJP-NER CDSC-E CDSC-R CBD PolEmo2.0-IN PolEmo2.0-OUT DYK PSC AR Avg
PoLitBert_v32k_linear_50k 92.3 91.5 92.2 64 89.8 76.1 60.2 97.9 87.6 83.51
PoLitBert_v32k_linear_50k_2ep 91.9 91.8 90.9 64.6 89.1 75.9 59.8 97.9 87.9 83.31
PoLitBert_v32k_tri_125k 93.6 91.7 91.8 62.4 90.3 75.7 59 97.4 87.2 83.23
PoLitBert_v32k_linear_125k_2ep 94.3 92.1 92.8 64 90.6 79.1 51.7 94.1 88.7 83.04
PoLitBert_v32k_tri_50k 93.9 91.7 92.1 57.6 88.8 77.9 56.6 96.5 87.7 82.53
PoLitBert_v32k_linear_125k 94 91.3 91.8 61.1 90.4 78.1 50.8 95.8 88.2 82.39
PoLitBert_v50k_linear_50k 92.8 92.3 91.7 57.7 90.3 80.6 42.2 97.4 88.5 81.50
PoLitBert_v32k_cos1_2_50k 92.5 91.6 90.7 60.1 89.5 73.5 49.1 95.2 87.5 81.08
PoLitBert_v32k_cos1_5_50k 93.2 90.7 89.5 51.7 89.5 74.3 49.1 97.1 87.5 80.29

A comparison with other developed models is available in the continuously updated leaderboard of evaluation tasks.

Details of models training

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Link to PoLitBert research log (same as below).

Experiment Model name Vocab size Scheduler BSZ WPB Steps Train tokens Train loss Valid loss Best (test) loss
#1 PoLitBert_v32k_linear_50k (tensorboard) 32k linear decay 8 192 4,07E+06 50 000 2,03E+11 1,502 1,460 1,422
#2 PoLitBert_v32k_tri_50k (tensorboard) 32k triangular 8 192 4,07E+06 50 000 2,03E+11 1,473 1,436 1,402
#3 PoLitBert_v32k_cos1_50k (tensorboard) 32k cosine mul=1 8 192 4,07E+06 23 030 9,37E+10 10,930 11,000 1,832
#4 PoLitBert_v32k_cos1_2_50k (tensorboard) 32k cosine mul=1 peak=0.0005 8 192 4,07E+06 50 000 2,03E+11 1,684 1,633 1,595
#5 PoLitBert_v32k_cos1_3_50k (tensorboard) 32k cosine mul=2 8 192 4,07E+06 3 735 1,52E+10 10,930
#6 PoLitBert_v32k_cos1_4_50k (tensorboard) 32k cosine mul=2 grad-clip=0.9 8 192 4,07E+06 4 954 2,02E+10 10,910 10,940 2,470
#8 PoLitBert_v32k_tri_125k (tensorboard) 32k triangular 8 192 4,07E+06 125 000 5,09E+11 1,435 1,313 1,363
#9 PoLitBert_v32k_cos1_5_50k (tensorboard) 32k cosine, mul=2, grad-clip=0.9 8 192 4,07E+06 125 000 5,09E+11 1,502 1,358 1,426
#10 PoLitBert_v32k_linear_125k (tensorboard) 32k linear decay 8 192 4,07E+06 125 000 5,09E+11 1,322 1,218 1,268
#11 PoLitBert_v50k_linear_50k (tensorboard) 50k linear decay 8 192 4,07E+06 50 000 2,04E+11 1,546 1,439 1,480

Used libraries

Instalation dependecies and problems

  • langdetect needs additional package
    • install sudo apt-get install libicu-dev
  • sentencepiece was installed from source code

Acknowledgements

This is the joint work of companies Ermlab Software and Literacka

Part of the work was financed from the grant of The Polish National Centre for Research and Development no. POIR.01.01.01-00-1213/19, the beneficiary of which was Literacka. Project title "Asystent wydawniczy - oprogramowanie do analizy treści, wykorzystujące algorytmy sztucznej inteligencji w celu zautomatyzowania procesu wydawniczego i predykcji sukcesów rynkowych publikacji."

We would like to express ours gratitude to NVidia Inception Programme and Amazon AWS for providing the free GPU credits - thank you!

Authors:

Also appreciate the help from

About Ermlab Software

Ermlab - Polish machine learning company

🦉 Website | :octocat: Repository

.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].