All Projects → graykode → ALBERT-Pytorch

graykode / ALBERT-Pytorch

Licence: Apache-2.0 license
Pytorch Implementation of ALBERT(A Lite BERT for Self-supervised Learning of Language Representations)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ALBERT-Pytorch

bert in a flask
A dockerized flask API, serving ALBERT and BERT predictions using TensorFlow 2.0.
Stars: ✭ 32 (-85.05%)
Mutual labels:  albert, bert
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1076.64%)
Mutual labels:  albert, bert
Albert zh
A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型
Stars: ✭ 3,500 (+1535.51%)
Mutual labels:  albert, bert
MobileQA
离线端阅读理解应用 QA for mobile, Android & iPhone
Stars: ✭ 49 (-77.1%)
Mutual labels:  albert, bert
classifier multi label seq2seq attention
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search
Stars: ✭ 26 (-87.85%)
Mutual labels:  albert, bert
CLUE pytorch
CLUE baseline pytorch CLUE的pytorch版本基线
Stars: ✭ 72 (-66.36%)
Mutual labels:  albert, bert
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1033.18%)
Mutual labels:  albert, bert
keras-bert-ner
Keras solution of Chinese NER task using BiLSTM-CRF/BiGRU-CRF/IDCNN-CRF model with Pretrained Language Model: supporting BERT/RoBERTa/ALBERT
Stars: ✭ 7 (-96.73%)
Mutual labels:  albert, bert
bert nli
A Natural Language Inference (NLI) model based on Transformers (BERT and ALBERT)
Stars: ✭ 97 (-54.67%)
Mutual labels:  albert, bert
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-89.25%)
Mutual labels:  albert, bert
Chineseglue
Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard
Stars: ✭ 1,548 (+623.36%)
Mutual labels:  albert, bert
tfbert
基于tensorflow1.x的预训练模型调用,支持单机多卡、梯度累积,XLA加速,混合精度。可灵活训练、验证、预测。
Stars: ✭ 54 (-74.77%)
Mutual labels:  albert, bert
Transformer-QG-on-SQuAD
Implement Question Generator with SOTA pre-trained Language Models (RoBERTa, BERT, GPT, BART, T5, etc.)
Stars: ✭ 28 (-86.92%)
Mutual labels:  albert, bert
classifier multi label
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification
Stars: ✭ 127 (-40.65%)
Mutual labels:  albert, bert
Medi-CoQA
Conversational Question Answering on Clinical Text
Stars: ✭ 22 (-89.72%)
Mutual labels:  albert, bert
LMMS
Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings
Stars: ✭ 79 (-63.08%)
Mutual labels:  bert
WSDM-Cup-2019
[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Stars: ✭ 62 (-71.03%)
Mutual labels:  bert
bert-AAD
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation
Stars: ✭ 27 (-87.38%)
Mutual labels:  bert
GoEmotions-pytorch
Pytorch Implementation of GoEmotions 😍😢😱
Stars: ✭ 95 (-55.61%)
Mutual labels:  bert
robo-vln
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Stars: ✭ 34 (-84.11%)
Mutual labels:  bert

ALBERT-Pytorch

Simply implementation of ALBERT(A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS) in Pytorch. This implementation is based on clean dhlee347/pytorchic-bert code.

Please make sure that I haven't checked the performance yet(i.e Fine-Tuning), only see SOP(sentence-order prediction) and MLM(Masked Langauge model with n-gram) loss falling.

CAUTION Fine-Tuning Tasks not yet!

File Overview

This contains 9 python files.

  • tokenization.py : Tokenizers adopted from the original Google BERT's code
  • models.py : Model classes for a general transformer
  • optim.py : A custom optimizer (BertAdam class) adopted from Hugging Face's code
  • train.py : A helper class for training and evaluation
  • utils.py : Several utility functions
  • pretrain.py : An example code for pre-training transformer

PreTraining

With WikiText 2 Dataset to try Unit-Test on GPU(t2.xlarge). You can also use parallel Multi-GPU or CPU.

$ CUDA_LAUNCH_BLOCKING=1 python pretrain.py \
            --data_file './data/wiki.train.tokens' \
            --vocab './data/vocab.txt' \
            --train_cfg './config/pretrain.json' \
            --model_cfg './config/albert_unittest.json' \
            --max_pred 75 --mask_prob 0.15 \
            --mask_alpha 4 --mask_beta 1 --max_gram 3 \
            --save_dir './saved' \
            --log_dir './logs'
			
cuda (1 GPUs)
Iter (loss=19.162): : 526it [02:25,  3.58it/s]
Epoch 1/25 : Average Loss 18.643
Iter (loss=12.589): : 524it [02:24,  3.63it/s]
Epoch 2/25 : Average Loss 13.650
Iter (loss=9.610): : 523it [02:24,  3.62it/s]
Epoch 3/25 : Average Loss 9.944
Iter (loss=10.612): : 525it [02:24,  3.60it/s]
Epoch 4/25 : Average Loss 9.018
Iter (loss=9.547): : 527it [02:25,  3.66it/s]
...

TensorboardX : loss_lm + loss_sop.

# to use TensorboardX
$ pip install -U protobuf tensorflow
$ pip install tensorboardX
$ tensorboard --logdir logs # expose http://server-ip:6006/

Introduce Keywords in ALBERT with code.

  1. SOP(sentence-order prediction) loss : In Original BERT, creating is-not-next(negative) two sentences with randomly picking, however ALBERT use negative examples the same two consecutive segments but with their order swapped.

    is_next = rand() < 0.5 # whether token_b is next to token_a or not
    
    tokens_a = self.read_tokens(self.f_pos, len_tokens, True)
    seek_random_offset(self.f_neg)
    #f_next = self.f_pos if is_next else self.f_neg
    f_next = self.f_pos # `f_next` should be next point
    tokens_b = self.read_tokens(f_next, len_tokens, False)
    
    if tokens_a is None or tokens_b is None: # end of file
    self.f_pos.seek(0, 0) # reset file pointer
    return
    
    # SOP, sentence-order prediction
    instance = (is_next, tokens_a, tokens_b) if is_next \
    else (is_next, tokens_b, tokens_a)
  2. Cross-Layer Parameter Sharing : ALBERT use cross-layer parameter sharing in Attention and FFN(FeedForward Network) to reduce number of parameter.

    class Transformer(nn.Module):
        """ Transformer with Self-Attentive Blocks"""
        def __init__(self, cfg):
            super().__init__()
            self.embed = Embeddings(cfg)
            # Original BERT not used parameter-sharing strategies
            # self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layers)])
    
            # To used parameter-sharing strategies
            self.n_layers = cfg.n_layers
            self.attn = MultiHeadedSelfAttention(cfg)
            self.proj = nn.Linear(cfg.hidden, cfg.hidden)
            self.norm1 = LayerNorm(cfg)
            self.pwff = PositionWiseFeedForward(cfg)
            self.norm2 = LayerNorm(cfg)
            # self.drop = nn.Dropout(cfg.p_drop_hidden)
    
        def forward(self, x, seg, mask):
            h = self.embed(x, seg)
    
            for _ in range(self.n_layers):
                # h = block(h, mask)
                h = self.attn(h, mask)
                h = self.norm1(h + self.proj(h))
                h = self.norm2(h + self.pwff(h))
    
            return h
  3. Factorized Embedding Parameterziation : ALBERT seperated Embedding matrix(VxD) to VxE and ExD.

    class Embeddings(nn.Module):
        "The embedding module from word, position and token_type embeddings."
     def __init__(self, cfg):
            super().__init__()
            # Original BERT Embedding
            # self.tok_embed = nn.Embedding(cfg.vocab_size, cfg.hidden) # token embedding
    
            # factorized embedding
            self.tok_embed1 = nn.Embedding(cfg.vocab_size, cfg.embedding)
            self.tok_embed2 = nn.Linear(cfg.embedding, cfg.hidden)
    
            self.pos_embed = nn.Embedding(cfg.max_len, cfg.hidden) # position embedding
            self.seg_embed = nn.Embedding(cfg.n_segments, cfg.hidden) # segment(token type) embedding
  4. n-gram MLM : MLM targets using n-gram masking (Joshi et al., 2019). Same as Paper, I use 3-gram. Code Reference from XLNET implementation.

Cannot Implemente now

  • In Paper, They use a batch size of 4096 LAMB optimizer with learning rate 0.00176 (You et al., 2019), train all model in 125,000 steps.

Author

  • Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).
  • Author Email : [email protected]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].