All Projects → google-research → Bigbird

google-research / Bigbird

Licence: apache-2.0
Transformers for Longer Sequences

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bigbird

Protoc Gen Struct Transformer
Transformation functions generator for Protocol Buffers.
Stars: ✭ 105 (-28.08%)
Mutual labels:  transformer
Bertqa Attention On Steroids
BertQA - Attention on Steroids
Stars: ✭ 112 (-23.29%)
Mutual labels:  transformer
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (-2.05%)
Mutual labels:  transformer
Multiturndialogzoo
Multi-turn dialogue baselines written in PyTorch
Stars: ✭ 106 (-27.4%)
Mutual labels:  transformer
Cjstoesm
A tool that can transform CommonJS to ESM
Stars: ✭ 109 (-25.34%)
Mutual labels:  transformer
Symfony Jsonapi
JSON API Transformer Bundle for Symfony 2 and Symfony 3
Stars: ✭ 114 (-21.92%)
Mutual labels:  transformer
Esbuild Jest
A Jest transformer using esbuild
Stars: ✭ 100 (-31.51%)
Mutual labels:  transformer
Tensorflowasr
集成了Tensorflow 2版本的端到端语音识别模型,并且RTF(实时率)在0.1左右/Mandarin State-of-the-art Automatic Speech Recognition in Tensorflow 2
Stars: ✭ 145 (-0.68%)
Mutual labels:  transformer
Overlappredator
[CVPR 2021, Oral] PREDATOR: Registration of 3D Point Clouds with Low Overlap.
Stars: ✭ 106 (-27.4%)
Mutual labels:  transformer
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-3.42%)
Mutual labels:  transformer
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+38079.45%)
Mutual labels:  transformer
Kiss
Code for the paper "KISS: Keeping it Simple for Scene Text Recognition"
Stars: ✭ 108 (-26.03%)
Mutual labels:  transformer
Sightseq
Computer vision tools for fairseq, containing PyTorch implementation of text recognition and object detection
Stars: ✭ 116 (-20.55%)
Mutual labels:  transformer
Ghostnet
CV backbones including GhostNet, TinyNet and TNT, developed by Huawei Noah's Ark Lab.
Stars: ✭ 1,744 (+1094.52%)
Mutual labels:  transformer
Tupe
Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.
Stars: ✭ 143 (-2.05%)
Mutual labels:  transformer
Conformer
Implementation of the convolutional module from the Conformer paper, for use in Transformers
Stars: ✭ 103 (-29.45%)
Mutual labels:  transformer
Mmsegmentation
OpenMMLab Semantic Segmentation Toolbox and Benchmark.
Stars: ✭ 2,875 (+1869.18%)
Mutual labels:  transformer
Transformer Pytorch
Transformer implementation in PyTorch.
Stars: ✭ 149 (+2.05%)
Mutual labels:  transformer
The Story Of Heads
This is a repository with the code for the ACL 2019 paper "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned" and the paper "Analyzing Source and Target Contributions to NMT Predictions".
Stars: ✭ 146 (+0%)
Mutual labels:  transformer
Transformer In Generating Dialogue
An Implementation of 'Attention is all you need' with Chinese Corpus
Stars: ✭ 121 (-17.12%)
Mutual labels:  transformer

Big Bird: Transformers for Longer Sequences

Not an official Google product.

What is BigBird?

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle.

As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization.

More details and comparisons can be found in our presentation.

Citation

If you find this useful, please cite our NeurIPS 2020 paper:

@article{zaheer2020bigbird,
  title={Big bird: Transformers for longer sequences},
  author={Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

Code

The most important directory is core. There are three main files in core.

  • attention.py: Contains BigBird linear attention mechanism
  • encoder.py: Contains the main long sequence encoder stack
  • modeling.py: Contains packaged BERT and seq2seq transformer models with BigBird attention

Colab/IPython Notebook

A quick fine-tuning demonstration for text classification is provided in imdb.ipynb

Create GCP Instance

Please create a project first and create an instance in a zone which has quota as follows

gcloud compute instances create \
  bigbird \
  --zone=europe-west4-a \
  --machine-type=n1-standard-16 \
  --boot-disk-size=50GB \
  --image-project=ml-images \
  --image-family=tf-2-3-1 \
  --maintenance-policy TERMINATE \
  --restart-on-failure \
  --scopes=cloud-platform

gcloud compute tpus create \
  bigbird \
  --zone=europe-west4-a \
  --accelerator-type=v3-32 \
  --version=2.3.1

gcloud compute ssh --zone "europe-west4-a" "bigbird"

For illustration we used instance name bigbird and zone europe-west4-a, but feel free to change them. More details about creating Google Cloud TPU can be found in online documentations.

Instalation and checkpoints

git clone https://github.com/google-research/bigbird.git
cd bigbird
pip3 install -e .

You can find pretrained and fine-tuned checkpoints in our Google Cloud Storage Bucket.

Optionally, you can download them using gsutil as

mkdir -p bigbird/ckpt
gsutil cp -r gs://bigbird-transformer/ bigbird/ckpt/

The storage bucket contains:

  • pretrained BERT model for base(bigbr_base) and large (bigbr_large) size. It correspond to BERT/RoBERTa-like encoder only models. Following original BERT and RoBERTa implementation they are transformers with post-normalization, i.e. layer norm is happening after the attention layer. However, following Rothe et al, we can use them partially in encoder-decoder fashion by coupling the encoder and decoder parameters, as illustrated in bigbird/summarization/roberta_base.sh launch script.
  • pretrained Pegasus Encoder-Decoder Transformer in large size(bigbp_large). Again following original implementation of Pegasus, they are transformers with pre-normalization. They have full set of separate encoder-decoder weights. Also for long document summarization datasets, we have converted Pegasus checkpoints (model.ckpt-0) for each dataset and also provided fine-tuned checkpoints (model.ckpt-300000) which works on longer documents.
  • fine-tuned tf.SavedModel for long document summarization which can be directly be used for prediction and evaluation as illustrated in the colab nootebook.

Running Classification

For quickly starting with BigBird, one can start by running the classification experiment code in classifier directory. To run the code simply execute

export GCP_PROJECT_NAME=bigbird-project  # Replace by your project name
export GCP_EXP_BUCKET=gs://bigbird-transformer-training/  # Replace
sh -x bigbird/classifier/base_size.sh

Using BigBird Encoder instead BERT/RoBERTa

To directly use the encoder instead of say BERT model, we can use the following code.

from bigbird.core import modeling

bigb_encoder = modeling.BertModel(...)

It can easily replace BERT's encoder.

Alternatively, one can also try playing with layers of BigBird encoder

from bigbird.core import encoder

only_layers = encoder.EncoderStack(...)

Understanding Flags & Config

All the flags and config are explained in core/flags.py. Here we explain some of the important config paramaters.

attention_type is used to select the type of attention we would use. Setting it to block_sparse runs the BigBird attention module.

flags.DEFINE_enum(
    "attention_type", "block_sparse",
    ["original_full", "simulated_sparse", "block_sparse"],
    "Selecting attention implementation. "
    "'original_full': full attention from original bert. "
    "'simulated_sparse': simulated sparse attention. "
    "'block_sparse': blocked implementation of sparse attention.")

block_size is used to define the size of blocks, whereas num_rand_blocks is used to set the number of random blocks. The code currently uses window size of 3 blocks and 2 global blocks. The current code only supports static tensors.

Important points to note:

  • Hidden dimension should be divisible by the number of heads.
  • Currently the code only handles tensors of static shape as it is primarily designed for TPUs which only works with statically shaped tensors.
  • For sequene length less than 1024, using original_full is advised as there is no benefit in using sparse BigBird attention.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].