Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → sajjjadayobi → ParsBigBird

sajjjadayobi / ParsBigBird

Licence: AGPL-3.0 license

Persian Bert For Long-Range Sequences

Programming Languages

Jupyter Notebook

11667 projects

Labels

transformers transfer-learning bert persian-nlp bigbird

Projects that are alternatives of or similar to ParsBigBird

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Stars: ✭ 39 (-32.76%)

Mutual labels: transformers, transfer-learning, bert

oreilly-bert-nlp

This repository contains code for the O'Reilly Live Online Training for BERT

Stars: ✭ 19 (-67.24%)

Mutual labels: transformers, transfer-learning, bert

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Stars: ✭ 229 (+294.83%)

Mutual labels: transformers, transfer-learning, bert

🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.

Stars: ✭ 3,409 (+5777.59%)

Mutual labels: transformers, transfer-learning, bert

nlp workshop odsc europe20

Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…

Stars: ✭ 127 (+118.97%)

Mutual labels: transformers, transfer-learning

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577

Stars: ✭ 216 (+272.41%)

Mutual labels: transformers, bert

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (+160.34%)

Mutual labels: transformers, bert

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Stars: ✭ 2,425 (+4081.03%)

Mutual labels: transformers, bert

question generator

An NLP system for generating reading comprehension questions

Stars: ✭ 188 (+224.14%)

Mutual labels: transformers, bert

Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.

Stars: ✭ 2,828 (+4775.86%)

Mutual labels: transformers, bert

label-studio-transformers

Label data using HuggingFace's transformers and automatically get a prediction service

Stars: ✭ 117 (+101.72%)

Mutual labels: transformers, bert

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Stars: ✭ 2,768 (+4672.41%)

Mutual labels: transformers, bert

Pytorch Sentiment Analysis

Tutorials on getting started with PyTorch and TorchText for sentiment analysis.

Stars: ✭ 3,209 (+5432.76%)

Mutual labels: transformers, bert

State of the Art Natural Language Processing

Stars: ✭ 2,518 (+4241.38%)

Mutual labels: transformers, bert

Text-Summarization

Abstractive and Extractive Text summarization using Transformers.

Stars: ✭ 38 (-34.48%)

Mutual labels: transformers, bert

GoEmotions-pytorch

Pytorch Implementation of GoEmotions 😍😢😱

Stars: ✭ 95 (+63.79%)

Mutual labels: transformers, bert

🛠️ Tools for Transformers compression using PyTorch Lightning ⚡

Stars: ✭ 56 (-3.45%)

Mutual labels: transformers, bert

Super easy library for BERT based NLP models

Stars: ✭ 1,678 (+2793.1%)

Mutual labels: transformers, bert

Anonymization of legal cases (Fr) based on Flair embeddings

Stars: ✭ 85 (+46.55%)

Mutual labels: transformers, bert

task-transferability

Data and code for our paper "Exploring and Predicting Transferability across NLP Tasks", to appear at EMNLP 2020.

Stars: ✭ 35 (-39.66%)

Mutual labels: transfer-learning, bert

View All Similar Projects ➔

ParsBigBird: Persian Bert For Long-Range Sequences

The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many tasks such as summarizing and answering questions require longer texts. In our work, we have trained the BigBird model for the Persian language to process texts up to 4096 in the Farsi (Persian) language using sparse attention.

Big bird's attention block from BigBird's paper

Evaluation: 🌡️

We have evaluated the model on three tasks with different sequence lengths

Name	Params	SnappFood (F1)	Digikala Magazine(F1)	PersianQA (F1)
distil-bigbird-fa-zwnj	78M	85.43%	94.05%	73.34%
bert-base-fa	118M	87.98%	93.65%	70.06%

Despite being as big as distill-bert, the model performs equally well as ParsBert and is much better on PersianQA which requires much more context
This evaluation was based on max_lentgh=2048 (It can be changed up to 4096)

How to use❓

As Contextualized Word Embedding

from transformers import BigBirdModel, AutoTokenizer

MODEL_NAME = "SajjadAyoubi/distil-bigbird-fa-zwnj"
# by default its in `block_sparse` block_size=32
model = BigBirdModel.from_pretrained(MODEL_NAME, block_size=32)
# you can use full attention like the following: use this when input isn't longer than 512
model = BigBirdModel.from_pretrained(MODEL_NAME, attention_type="original_full")

text = "😃 امیدوارم مدل بدردبخوری باشه چون خیلی طول کشید تا ترین بشه"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens) # contextualized embedding

As Fill Blank

from transformers import pipeline

MODEL_NAME = 'SajjadAyoubi/distil-bigbird-fa-zwnj'
fill = pipeline('fill-mask', model=MODEL_NAME, tokenizer=MODEL_NAME)
results = fill('تهران پایتخت [MASK] است.')
print(results[0]['token_str'])
>>> 'ایران'

Pretraining details: 🔭

This model was pretrained using a masked language model (MLM) objective on the Persian section of the Oscar dataset. Following the original BERT training, 15% of tokens were masked. This was first described in this paper and released in this repository. Documents longer than 4096 were split into multiple documents, while documents much smaller than 4096 were merged using the [SEP] token. Model is warm started from distilbert-fa’s checkpoint.

For more details, you can take a look at config.json at the model card in 🤗 Model Hub

Fine Tuning Recommendations: 🐤

Due to the model's memory requirements, gradient_checkpointing and gradient_accumulation should be used to maintain a reasonable batch size. Considering this model isn't really big, it's a good idea to first fine-tune it on your dataset using Masked LM objective (also called intermediate fine-tuning) before implementing the main task. In block_sparse mode, it doesn't matter how many tokens are input. It just attends to 256 tokens. Furthermore, original_full should be used up to 512 sequence lengths (instead of block sparse).

Fine Tuning Examples 👷‍♂️👷‍♀️

Dataset	Fine Tuning Example
Digikala Magazine Text Classification

Contact us: 🤝

If you have a technical question regarding the model, pretraining, code or publication, please create an issue in the repository. This is the fastest way to reach us.

Citation: ↩️

we didn't publish any papers on the work. However, if you did, please cite us properly with an entry like one below.

@misc{ParsBigBird,
  author          = {Ayoubi, Sajjad},
  title           = {ParsBigBird: Persian Bert For Long-Range Sequences},
  year            = 2021,
  publisher       = {GitHub},
  journal         = {GitHub repository},
  howpublished    = {\url{https://github.com/SajjjadAyobi/ParsBigBird}},
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 58

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗