All Projects → sagorbrur → bangla-bert

sagorbrur / bangla-bert

Licence: MIT License
Bangla-Bert is a pretrained bert model for Bengali language

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to bangla-bert

BangalASR
Transformer based Bangla Speech Recognition
Stars: ✭ 20 (-51.22%)
Mutual labels:  transformers, bangla, bangla-nlp
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-41.46%)
Mutual labels:  transformers, bert
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-63.41%)
Mutual labels:  transformers, bert
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (+48.78%)
Mutual labels:  transformers, bert
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (-7.32%)
Mutual labels:  transformers, bert
label-studio-transformers
Label data using HuggingFace's transformers and automatically get a prediction service
Stars: ✭ 117 (+185.37%)
Mutual labels:  transformers, bert
GoEmotions-pytorch
Pytorch Implementation of GoEmotions 😍😢😱
Stars: ✭ 95 (+131.71%)
Mutual labels:  transformers, bert
banglabert
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…
Stars: ✭ 186 (+353.66%)
Mutual labels:  bert, bangla-nlp
golgotha
Contextualised Embeddings and Language Modelling using BERT and Friends using R
Stars: ✭ 39 (-4.88%)
Mutual labels:  transformers, bert
robo-vln
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Stars: ✭ 34 (-17.07%)
Mutual labels:  transformers, bert
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (+107.32%)
Mutual labels:  transformers, bert
anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (+107.32%)
Mutual labels:  transformers, bert
Transformers-Tutorials
This repository contains demos I made with the Transformers library by HuggingFace.
Stars: ✭ 2,828 (+6797.56%)
Mutual labels:  transformers, bert
bert-squeeze
🛠️ Tools for Transformers compression using PyTorch Lightning ⚡
Stars: ✭ 56 (+36.59%)
Mutual labels:  transformers, bert
wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
Stars: ✭ 39 (-4.88%)
Mutual labels:  transformers, bert
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+268.29%)
Mutual labels:  transformers, bert
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (+129.27%)
Mutual labels:  transformers, bert
oreilly-bert-nlp
This repository contains code for the O'Reilly Live Online Training for BERT
Stars: ✭ 19 (-53.66%)
Mutual labels:  transformers, bert
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+458.54%)
Mutual labels:  transformers, bert
ParsBigBird
Persian Bert For Long-Range Sequences
Stars: ✭ 58 (+41.46%)
Mutual labels:  transformers, bert

Bangla BERT Base

A long way passed. Here is our Bangla-Bert! It is now available in huggingface model hub.

Bangla-Bert-Base is a pretrained language model of Bengali language using mask language modeling described in BERT and it's github repository

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Download Model

TF Version Pytorch Version Vocab
Bangla BERT Base ----- Huggingface Hub Vocab

Pretrain Corpus Details

Corpus was downloaded from two main sources:

After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.

sentence 1
sentence 2

sentence 1
sentence 2

Building Vocab

We used BNLP package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. Our final vocab file availabe at https://github.com/sagorbrur/bangla-bert and also at huggingface model hub.

Training Details

  • Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
  • Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
  • Total Training Steps: 1 Million
  • The model was trained on a single Google Cloud TPU

Evaluation Results

LM Evaluation Results

After training 1 millions steps here is the evaluation resutls.

global_step = 1000000
loss = 2.2406516
masked_lm_accuracy = 0.60641736
masked_lm_loss = 2.201459
next_sentence_accuracy = 0.98625
next_sentence_loss = 0.040997364
perplexity = numpy.exp(2.2406516) = 9.393331287442784
Loss for final step: 2.426227

Downstream Task Evaluation Results

  • Evaluation on Bengali Classification Benchmark Datasets

Huge Thanks to Nick Doiron for providing evalution results of classification task. He used Bengali Classification Benchmark datasets for classification task. Comparing to Nick's Bengali electra and multi-lingual BERT, Bangla BERT Base achieves state of the art result. Here is the evaluation script. Check comparison between Bangla-BERT with recent other Bengali BERT here

Model Sentiment Analysis Hate Speech Task News Topic Task Average
mBERT 68.15 52.32 72.27 64.25
Bengali Electra 69.19 44.84 82.33 65.45
Bangla BERT Base 70.37 71.83 89.19 77.13

We evaluated Bangla-BERT-Base with Wikiann Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT).
Bangla-BERT-Base got a third-place where mBERT got first and XML-R got second place after training these models 5 epochs.

Base Pre-trained Model F1 Score Accuracy
mBERT-uncased 97.11 97.68
XLM-R 96.22 97.03
Indic-BERT 92.66 94.74
Bangla-BERT-Base 95.57 97.49

All four model trained with transformers-token-classification notebook. You can find all models evaluation results here

Also you can check these below paper list. They evaluated this model on their datasets.

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Check Bangla BERT Visualize

bertviz

How to Use

Bangla BERT Tokenizer

from transformers import AutoTokenizer, AutoModel

bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
text = "আমি বাংলায় গান গাই।"
bnbert_tokenizer.tokenize(text)
# ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']

MASK Generation

You can use this model directly with a pipeline for masked language modeling:

from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
  print(pred)

# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}

Author

Sagor Sarker

Acknowledgements

  • Thanks to Google TensorFlow Research Cloud (TFRC) for providing the free TPU credits - thank you!
  • Thank to all the people around, who always helping us to build something for Bengali.

Reference

Citation

If you find this model helpful, please cite this.

@misc{Sagor_2020,
  title   = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understading},
  author  = {Sagor Sarker},
  year    = {2020},
  url    = {https://github.com/sagorbrur/bangla-bert}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].