All Projects → dumitrescustefan → Romanian-Transformers

dumitrescustefan / Romanian-Transformers

Licence: MIT license
This repo is the home of Romanian Transformers.

Projects that are alternatives of or similar to Romanian-Transformers

bert-movie-reviews-sentiment-classifier
Build a Movie Reviews Sentiment Classifier with Google's BERT Language Model
Stars: ✭ 12 (-80%)
Mutual labels:  language-model, bert
Tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Stars: ✭ 5,077 (+8361.67%)
Mutual labels:  language-model, bert
wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
Stars: ✭ 39 (-35%)
Mutual labels:  language-model, bert
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+10993.33%)
Mutual labels:  language-model, bert
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+3941.67%)
Mutual labels:  language-model, bert
Keras Bert
Implementation of BERT that could load official pre-trained models for feature extraction and prediction
Stars: ✭ 2,264 (+3673.33%)
Mutual labels:  language-model, bert
Bert Pytorch
Google AI 2018 BERT pytorch implementation
Stars: ✭ 4,642 (+7636.67%)
Mutual labels:  language-model, bert
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+92803.33%)
Mutual labels:  language-model, bert
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+5581.67%)
Mutual labels:  language-model, bert
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+3188.33%)
Mutual labels:  language-model, bert
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+281.67%)
Mutual labels:  language-model, bert
FasterTransformer
Transformer related optimization, including BERT, GPT
Stars: ✭ 1,571 (+2518.33%)
Mutual labels:  bert
verbecc
Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian
Stars: ✭ 45 (-25%)
Mutual labels:  romanian-language
oreilly-bert-nlp
This repository contains code for the O'Reilly Live Online Training for BERT
Stars: ✭ 19 (-68.33%)
Mutual labels:  bert
bert for corrector
基于bert进行中文文本纠错
Stars: ✭ 199 (+231.67%)
Mutual labels:  bert
banglabert
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…
Stars: ✭ 186 (+210%)
Mutual labels:  bert
ml
machine learning
Stars: ✭ 29 (-51.67%)
Mutual labels:  language-model
mlp-gpt-jax
A GPT, made only of MLPs, in Jax
Stars: ✭ 53 (-11.67%)
Mutual labels:  language-model
question generator
An NLP system for generating reading comprehension questions
Stars: ✭ 188 (+213.33%)
Mutual labels:  bert
Transformer-QG-on-SQuAD
Implement Question Generator with SOTA pre-trained Language Models (RoBERTa, BERT, GPT, BART, T5, etc.)
Stars: ✭ 28 (-53.33%)
Mutual labels:  bert

Romanian Transformers

This repo is meant as a space to centralize Romanian Transformers and to provide a uniform evaluation. Contributions are welcome.

We're using HuggingFace's Transformers lib, an awesome tool for NLP. What's BERT you ask? Here's a clear and condensed article about what BERT is and what it can do. Also check out this summary of different transformer models.

What follows is the list of Romanian transformer models, both masked and conditional language models.

Feel free to open an issue and add your model/eval here!

Masked Language Models (MLMs)

Model Type Size Article/Citation/Source Pre-trained / Fine-tuned Release Date
dumitrescustefan/bert-base-romanian-cased-v1 BERT 124M PDF / Cite Pre-trained Apr, 2020
dumitrescustefan/bert-base-romanian-uncased-v1 BERT 124M PDF / Cite Pre-trained Apr, 2020
racai/distillbert-base-romanian-cased DistilBERT 81M - Pre-trained Apr, 2021
readerbench/RoBERT-small BERT 19M PDF Pre-trained May, 2021
readerbench/RoBERT-base BERT 114M PDF Pre-trained May, 2021
readerbench/RoBERT-large BERT 341M PDF Pre-trained May, 2021
dumitrescustefan/bert-base-romanian-ner BERT 124M HF Space Named Entity Recognition on RONECv2 Jan, 2022
snisioi/bert-legal-romanian-cased-v1 BERT 124M - Legal documents on MARCELLv2 Jan, 2022
readerbench/jurBERT-base BERT 111M PDF Legal documents Oct, 2021
readerbench/jurBERT-large BERT 337M PDF Legal documents Oct, 2021

Generative Language Models (CLMs)

Model Type Size Article/Citation/Source Pre-trained / Fine-tuned Release Date
dumitrescustefan/gpt-neo-romanian-780m GPT-Neo 780M not yet / HF Space Pre-trained Sep, 2022
readerbench/RoGPT2-base GPT2 124M PDF Pre-trained Jul, 2021
readerbench/RoGPT2-medium GPT2 354M PDF Pre-trained Jul, 2021
readerbench/RoGPT2-large GPT2 774M PDF Pre-trained Jul, 2021

NEW: Check out this HF Space to play with Romanian generative models: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation

Model evaluation

Models are evaluated using the public Colab script available here. All results reported are the average score of 5 runs, using the same parameters. For larger models, if it was possible, a larger batch-size was simulated by accumulating gradients, such that all models should have the same effective batch size. Only standard models (not finetuned for a particular task) and that could fit in 16GB of RAM are evaluated.

The tests cover the following fields, and, for brevity, we select a single metric from each field:

  • Named Entity Recognition: on RONECv2 we measure the test strict match measure. A model must correctly detect whether a word is an entity and tag it with its correct class.
  • Part of Speech Tagging: on ro-pos-tagger we measure the test UPOS F1 score. This test should reveal how well a model understands the language's structure.
  • Semantic Textual Similarity: on RO-STS we measure the test Pearson correlation coefficient. Given two sentences the model must predict whether they are entailed, contradictory or are on different subjects (neutral). This test should highlight how well a model can embed the meaning of a sentence.
  • Emotion Detection: on the REDv2 emotion detection in Romanian Tweets we measure the test Hamming loss in the classification setting (lower is better). This test should show how well a model can "understand" emotions from short texts.
  • Perplexity: on wiki-ro's test split, we measure CLM-only models' perplexity with a stride of 512 and a batch size of 4.

MLM model evaluation

Model Type Size NER/EM_strict RoSTS/Pearson Ro-pos-tagger/UPOS F1 REDv2/hamming_loss
dumitrescustefan/bert-base-romanian-cased-v1 BERT 124M 0.8815 0.7966 0.982 0.1039
dumitrescustefan/bert-base-romanian-uncased-v1 BERT 124M 0.8572 0.8149 0.9826 0.1038
racai/distillbert-base-romanian-cased DistilBERT 81M 0.8573 0.7285 0.9637 0.1119
readerbench/RoBERT-small BERT 19M 0.8512 0.7827 0.9794 0.1085
readerbench/RoBERT-base BERT 114M 0.8768 0.8102 0.9819 0.1041

CLM model evaluation

Model Type Size NER/EM_strict RoSTS/Pearson Ro-pos-tagger/UPOS F1 REDv2/hamming_loss Perplexity
readerbench/RoGPT2-base GPT2 124M 0.6865 0.7963 0.9009 0.1068 52.34
readerbench/RoGPT2-medium GPT2 354M 0.7123 0.7979 0.9098 0.114 31.26

What you can do with these models

Using HuggingFace's Transformers lib, instantiate a model and replace the model name as necessary. Then use an appropriate model head depending on your task. Here are a few examples:

Get token embeddings
from transformers import AutoTokenizer, AutoModel
import torch

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")

# tokenize a sentence and run through the model
input_ids = tokenizer.encode("Acesta este un test.", add_special_tokens=True, return_tensors="pt")
outputs = model(input_ids)

# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
  • For dumitrescustefan/* models, remember to correct the ș/ț diacritics before feeding it to the model (it was trained only with the correct, comma-style diacritics, and will see the cedilla ş an ţ as UNKs and thus decrease overall performance):
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

Write text with generative models

Give a prompt to a generative model and let it write:

tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")
model = AutoModelForCausalLM.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")

input_ids = tokenizer.encode("Cine a fost Mihai Eminescu? A fost", return_tensors='pt')

text = model.generate(input_ids, max_length=128, do_sample=True, no_repeat_ngram_size=2, top_k=50, top_p=0.9, early_stopping=True)

print(tokenizer.decode(text[0], skip_special_tokens=True))

P.S. You can test all generative models here: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation

Final note

  • While this repo initially started as an in-depth of a single transformer model back in 2020, with the express hope that more models would be added quickly, it turned out that training a good model is not that easy, and it takes a lot of effort to curate the data and then have access to sufficient compute power. So, I feel it's no longer useful to just list a couple of models, and it would make more impact to list all the models I could find that are Romanian-only, and have a minimal level of performance/documentation. Here you go :)
  • This repo contained some code to download and clean a Romanian corpus. I have removed this part as Oscar is now offered on HuggingFace (new version), and OPUS's API is no longer working as it should (some manual filtering is now required, not to mention new resources are being added constantly) - thus maintaining this code is not really feasible.
  • Please contribute to this repo with new Romanian models you mihgt find, or with citations or updates to existing models.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].