All Projects → csebuetnlp → banglabert

csebuetnlp / banglabert

Licence: other
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to banglabert

TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (-54.3%)
Mutual labels:  named-entity-recognition, bert
Mt Dnn
Multi-Task Deep Neural Networks for Natural Language Understanding
Stars: ✭ 1,871 (+905.91%)
Mutual labels:  named-entity-recognition, bert
knowledge-graph-nlp-in-action
从模型训练到部署,实战知识图谱(Knowledge Graph)&自然语言处理(NLP)。涉及 Tensorflow, Bert+Bi-LSTM+CRF,Neo4j等 涵盖 Named Entity Recognition,Text Classify,Information Extraction,Relation Extraction 等任务。
Stars: ✭ 58 (-68.82%)
Mutual labels:  named-entity-recognition, bert
DeepNER
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.
Stars: ✭ 9 (-95.16%)
Mutual labels:  named-entity-recognition, bert
OpenUE
OpenUE是一个轻量级知识图谱抽取工具 (An Open Toolkit for Universal Extraction from Text published at EMNLP2020: https://aclanthology.org/2020.emnlp-demos.1.pdf)
Stars: ✭ 274 (+47.31%)
Mutual labels:  named-entity-recognition, bert
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (-18.82%)
Mutual labels:  named-entity-recognition, bert
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-33.33%)
Mutual labels:  named-entity-recognition, sentiment-classification
WSDM-Cup-2019
[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Stars: ✭ 62 (-66.67%)
Mutual labels:  natural-language-inference, bert
ChineseNER
中文NER的那些事儿
Stars: ✭ 241 (+29.57%)
Mutual labels:  bert, bert-fine-tuning
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+1101.61%)
Mutual labels:  named-entity-recognition, bert
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-88.17%)
Mutual labels:  document-classification, bert
BERTOverflow
A Pre-trained BERT on StackOverflow Corpus
Stars: ✭ 40 (-78.49%)
Mutual labels:  named-entity-recognition, bert
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-88.71%)
Mutual labels:  document-classification, bert
bern
A neural named entity recognition and multi-type normalization tool for biomedical text mining
Stars: ✭ 151 (-18.82%)
Mutual labels:  named-entity-recognition, bert
NSP-BERT
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"
Stars: ✭ 166 (-10.75%)
Mutual labels:  natural-language-inference, bert
Bert Bilstm Crf Ner
Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services
Stars: ✭ 3,838 (+1963.44%)
Mutual labels:  named-entity-recognition, bert
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+1625.27%)
Mutual labels:  bert, sentiment-classification
bert nli
A Natural Language Inference (NLI) model based on Transformers (BERT and ALBERT)
Stars: ✭ 97 (-47.85%)
Mutual labels:  natural-language-inference, bert
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1253.76%)
Mutual labels:  named-entity-recognition, bert
Fill-the-GAP
[ACL-WS] 4th place solution to gendered pronoun resolution challenge on Kaggle
Stars: ✭ 13 (-93.01%)
Mutual labels:  natural-language-inference, bert

BanglaBERT

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Association for Computational Linguistics: NAACL 2022.

Table of Contents

Models

The pretrained model checkpoints are available at Huggingface model hub.

To use these models for the supported downstream tasks in this repository see Training & Evaluation.

Note: These models were pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.

Datasets

We are also releasing the Bangla Natural Language Inference (NLI) and Bangla Question Answering (QA) datasets introduced in the paper.

Setup

For installing the necessary requirements, use the following bash snippet

$ git clone https://github.com/csebuetnlp/banglabert
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh 
  • Use the newly created environment for running the scripts in this repository.

Training & Evaluation

To use the pretrained model for finetuning / inference on different downstream tasks see the following section:

  • Sequence Classification.
    • For single sequence classification such as
      • Document classification
      • Sentiment classification
      • Emotion classification etc.
    • For double sequence classification such as
      • Natural Language Inference (NLI)
      • Paraphrase detection etc.
  • Token Classification.
    • For token tagging / classification tasks such as
      • Named Entity Recognition (NER)
      • Parts of Speech Tagging (PoS) etc.
  • Question Answering.
    • For tasks such as,
      • Extractive Question Answering
      • Open-domain Question Answering

Benchmarks

  • Zero-shot cross-lingual transfer-learning
Model Params SC (macro-F1) NLI (accuracy) NER (micro-F1) QA (EM/F1) BangLUE score
mBERT 180M 27.05 62.22 39.27 59.01/64.18 50.35
XLM-R (base) 270M 42.03 72.18 45.37 55.03/61.83 55.29
XLM-R (large) 550M 49.49 78.13 56.48 71.13/77.70 66.59
BanglishBERT 110M 48.39 75.26 55.56 72.87/78.63 66.14
  • Supervised fine-tuning
Model Params SC (macro-F1) NLI (accuracy) NER (micro-F1) QA (EM/F1) BangLUE score
mBERT 180M 67.59 75.13 68.97 67.12/72.64 70.29
XLM-R (base) 270M 69.54 78.46 73.32 68.09/74.27 72.82
XLM-R (large) 550M 70.97 82.40 78.39 73.15/79.06 76.79
sahajBERT 18M 71.12 76.92 70.94 65.48/70.69 71.03
BanglishBERT 110M 70.61 80.95 76.28 72.43/78.40 75.73
BanglaBERT 110M 72.89 82.80 77.78 72.63/79.34 77.09

The benchmarking datasets are as follows:

Acknowledgements

We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Creative Commons License

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@inproceedings{bhattacharjee-etal-2022-banglabert,
    title = "{B}angla{BERT}: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in {B}angla",
    author = "Bhattacharjee, Abhik  and
      Hasan, Tahmid  and
      Ahmad, Wasi  and
      Mubasshir, Kazi Samin  and
      Islam, Md Saiful  and
      Iqbal, Anindya  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.98",
    pages = "1318--1327",
    abstract = "In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed {`}Bangla2B+{'}) by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at \url{https://github.com/csebuetnlp/banglabert} to advance Bangla NLP.",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].