All Projects → neuralmind-ai → Portuguese Bert

neuralmind-ai / Portuguese Bert

Licence: other
Portuguese pre-trained BERT models

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Portuguese Bert

Nlp Python Deep Learning
NLP in Python with Deep Learning
Stars: ✭ 374 (-8.56%)
Mutual labels:  natural-language-processing
Tf Seq2seq
Sequence to sequence learning using TensorFlow.
Stars: ✭ 387 (-5.38%)
Mutual labels:  natural-language-processing
Anlp19
Course repo for Applied Natural Language Processing (Spring 2019)
Stars: ✭ 402 (-1.71%)
Mutual labels:  natural-language-processing
Usc Ds Relationextraction
Distantly Supervised Relation Extraction
Stars: ✭ 378 (-7.58%)
Mutual labels:  natural-language-processing
Multiwoz
Source code for end-to-end dialogue model from the MultiWOZ paper (Budzianowski et al. 2018, EMNLP)
Stars: ✭ 384 (-6.11%)
Mutual labels:  natural-language-processing
Dl topics
List of DL topics and resources essential for cracking interviews
Stars: ✭ 392 (-4.16%)
Mutual labels:  natural-language-processing
Awesome Text Generation
A curated list of recent models of text generation and application
Stars: ✭ 370 (-9.54%)
Mutual labels:  natural-language-processing
Gnn4nlp Papers
A list of recent papers about Graph Neural Network methods applied in NLP areas.
Stars: ✭ 405 (-0.98%)
Mutual labels:  natural-language-processing
Transformers Tutorials
Github repo with tutorials to fine tune transformers for diff NLP tasks
Stars: ✭ 384 (-6.11%)
Mutual labels:  natural-language-processing
Projects
🪐 End-to-end NLP workflows from prototype to production
Stars: ✭ 397 (-2.93%)
Mutual labels:  natural-language-processing
Natural Language Processing
Programming Assignments and Lectures for Stanford's CS 224: Natural Language Processing with Deep Learning
Stars: ✭ 377 (-7.82%)
Mutual labels:  natural-language-processing
Nlpnet
A neural network architecture for NLP tasks, using cython for fast performance. Currently, it can perform POS tagging, SRL and dependency parsing.
Stars: ✭ 379 (-7.33%)
Mutual labels:  natural-language-processing
Sherlock
Natural-language event parser for Javascript
Stars: ✭ 393 (-3.91%)
Mutual labels:  natural-language-processing
Beginner nlp
A curated list of beginner resources in Natural Language Processing
Stars: ✭ 376 (-8.07%)
Mutual labels:  natural-language-processing
D2l Vn
Một cuốn sách tương tác về học sâu có mã nguồn, toán và thảo luận. Đề cập đến nhiều framework phổ biến (TensorFlow, Pytorch & MXNet) và được sử dụng tại 175 trường Đại học.
Stars: ✭ 402 (-1.71%)
Mutual labels:  natural-language-processing
Data Science
Collection of useful data science topics along with code and articles
Stars: ✭ 315 (-22.98%)
Mutual labels:  natural-language-processing
My Cs Degree
A CS degree with a focus on full-stack ML engineering, 2020
Stars: ✭ 391 (-4.4%)
Mutual labels:  natural-language-processing
Reductio
Automatic summarizer text in Swift
Stars: ✭ 406 (-0.73%)
Mutual labels:  natural-language-processing
Ln2sql
A tool to query a database in natural language
Stars: ✭ 403 (-1.47%)
Mutual labels:  natural-language-processing
Neuronlp2
Deep neural models for core NLP tasks (Pytorch version)
Stars: ✭ 397 (-2.93%)
Mutual labels:  natural-language-processing

BERTimbau - Portuguese BERT

This repository contains pre-trained BERT models trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. Model artifacts for TensorFlow and PyTorch can be found below.

The models are a result of an ongoing Master's Program. The text submission for Qualifying Exam is also included in the repository in PDF format, which contains more details about the pre-training procedure, vocabulary generation and downstream usage in the task of Named Entity Recognition.

Download

Model TensorFlow checkpoint PyTorch checkpoint Vocabulary
BERTimbau Base (aka bert-base-portuguese-cased) Download Download Download
BERTimbau Large (aka bert-large-portuguese-cased) Download Download Download

Evaluation benchmarks

The models were benchmarked on three tasks (Sentence Textual Similarity, Recognizing Textual Entailment and Named Entity Recognition) and compared to previous published results and Multilingual BERT. Metrics are: Pearson's correlation for STS and F1-score for RTE and NER.

Task Test Dataset BERTimbau-Large BERTimbau-Base mBERT Previous SOTA
STS ASSIN2 0.852 0.836 0.809 0.83 [1]
RTE ASSIN2 90.0 89.2 86.8 88.3 [1]
NER MiniHAREM (5 classes) 83.7 83.1 79.2 82.3 [2]
NER MiniHAREM (10 classes) 78.5 77.6 73.1 74.6 [2]

NER experiments code

Code and instructions to reproduce the Named Entity Recognition experiments are in ner_evaluation/ directory.

PyTorch usage example

Our PyTorch artifacts are compatible with the 🤗Huggingface Transformers library and are also available on the Community models:

from transformers import AutoModel, AutoTokenizer

# Using the community model
# BERT Base
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

# BERT Large
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')

# or, using BertModel and BertTokenizer directly
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt', do_lower_case=False)
model = BertModel.from_pretrained('path/to/bert_dir')  # Or other BERT model class

Acknowledgement

We would like to thank Google for Cloud credits under a research grant that allowed us to train these models.

References

[1] Multilingual Transformer Ensembles for Portuguese Natural Language Task

[2] Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

How to cite this work

@inproceedings{souza2020bertimbau,
    author    = {Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
    title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
    booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
    year      = {2020}
}

@article{souza2019portuguese,
    title={Portuguese Named Entity Recognition using BERT-CRF},
    author={Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
    journal={arXiv preprint arXiv:1909.10649},
    url={http://arxiv.org/abs/1909.10649},
    year={2019}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].