All Projects → xiongma → roberta-wwm-base-distill

xiongma / roberta-wwm-base-distill

Licence: Apache-2.0 license
this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to roberta-wwm-base-distill

Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+3875.41%)
Mutual labels:  pretrained-models, bert, roberta
vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Stars: ✭ 22 (-63.93%)
Mutual labels:  pretrained-models, bert, roberta
Bertviz
Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)
Stars: ✭ 3,443 (+5544.26%)
Mutual labels:  bert, roberta
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+147.54%)
Mutual labels:  pretrained-models, bert
HugsVision
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision
Stars: ✭ 154 (+152.46%)
Mutual labels:  pretrained-models, bert
Albert zh
A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型
Stars: ✭ 3,500 (+5637.7%)
Mutual labels:  bert, roberta
Chinese Bert Wwm
Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型)
Stars: ✭ 6,357 (+10321.31%)
Mutual labels:  bert, roberta
syntaxdot
Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
Stars: ✭ 32 (-47.54%)
Mutual labels:  pretrained-models, bert
Tianchi2020ChineseMedicineQuestionGeneration
2020 阿里云天池大数据竞赛-中医药文献问题生成挑战赛
Stars: ✭ 20 (-67.21%)
Mutual labels:  bert, roberta
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+3134.43%)
Mutual labels:  pretrained-models, bert
AiSpace
AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0
Stars: ✭ 28 (-54.1%)
Mutual labels:  pretrained-models, bert
CLUE pytorch
CLUE baseline pytorch CLUE的pytorch版本基线
Stars: ✭ 72 (+18.03%)
Mutual labels:  bert, roberta
erc
Emotion recognition in conversation
Stars: ✭ 34 (-44.26%)
Mutual labels:  bert, roberta
Roberta zh
RoBERTa中文预训练模型: RoBERTa for Chinese
Stars: ✭ 1,953 (+3101.64%)
Mutual labels:  bert, roberta
KLUE
📖 Korean NLU Benchmark
Stars: ✭ 420 (+588.52%)
Mutual labels:  bert, roberta
transformer-models
Deep Learning Transformer models in MATLAB
Stars: ✭ 90 (+47.54%)
Mutual labels:  pretrained-models, bert
bert-squeeze
🛠️ Tools for Transformers compression using PyTorch Lightning ⚡
Stars: ✭ 56 (-8.2%)
Mutual labels:  bert, distillation
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-60.66%)
Mutual labels:  bert, roberta
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+91280.33%)
Mutual labels:  pretrained-models, bert
les-military-mrc-rank7
莱斯杯:全国第二届“军事智能机器阅读”挑战赛 - Rank7 解决方案
Stars: ✭ 37 (-39.34%)
Mutual labels:  bert, roberta

A Roberta-ext-wwm Distillation Model

This is a chinese Roberta wwm distillation model which was distilled from roberta-ext-wwm-large. The large model is from this github, thanks for his contribution.

Based On

This model was trained based on this paper, which was punished by huggingface.

Corpus

For train this model, I used baike_qa2019, news2016_zh, webtext_2019, wiki_zh. this data can be found in this github

Model Download

I just support BaiduYun to down this model, this link is below.

Model BaiduYun
Roberta-wwm-ext-base-distill, Chinese Tensorflow
Roberta-wwm-ext-large-3layers-distill, Chinese Tensorflow 26hu
Roberta-wwm-ext-large-6layers-distill, Chinese Tensorflow seou

Train Detail

To train this model, I used 2 steps.

  • I used roberta_ext_wwm_large model to get all examples tokens' output.

  • I used the output to train the model, which inited roberta_ext_wwm_base pretrain model weights.

Dataset

  • I just used 5 different ways to mask one sentence, not dynamic mask.

  • Every example just use maximum 20 token masks

Teacher Model

  • I used Roberta large model to get every masked token's output, which was mapped to vocab, I just kept max 128 dimensions, you could ask why didn't you keep more dimensions, first, the storge is too much, second, I think keep too much is unneccessary.

Student Model

  • Loss: In this training, I use 2 loss functions, first is cross entropy, second is cosin loss, add them together, I think it has a big improvement if I use another loss function, but I didn't have too much resource to train this model, because my free Google TPU expired.

  • Other Parameters

Parameter batch size learning rate training step warming step
Roberta-wwm-ext-base-distill, Chinese 384 5e-5 1M 2W
Roberta-wwm-ext-large-3layers-distill, Chinese 128 3e-5 3M 2.5K
Roberta-wwm-ext-large-6layers-distill, Chinese 512 8e-5 1M 5K

Comparison

In this part, every task I just ran one time, the result is below.

Classification

Model AFQMC CMNLI TNEWS
Roberta-wwm-ext-base, Chinese 74.04% 80.51% 56.94%
Roberta-wwm-ext-base-distill, Chinese 74.44% 81.1% 57.6%
Roberta-wwm-ext-large-3layers-distill, Chinese 68.8% 75.5% 55.7%
Roberta-wwm-ext-large-6layers-distill, Chinese 72% 79.3% 56.7%
Model LCQMC dev LCQMC test
Roberta-wwm-ext-base, Chinese 89% 86.5%
Roberta-wwm-ext-base-distill, Chinese 89% 87.2%
Roberta-wwm-ext-large-3layers-distill, Chinese 85.1% 86%
Roberta-wwm-ext-large-6layers-distill, Chinese 87.7% 86.7%

SQUAD

Model CMRC2018 dev (F1/EM)
Roberta-wwm-ext-base, Chinese 84.72%/65.24%
Roberta-wwm-ext-base-distill, Chinese 85.2%/65.20%
Roberta-wwm-ext-large-3layers-distill, Chinese 78.5%/57.4%
Roberta-wwm-ext-large-6layers-distill, Chinese 82.6%/61.7%

In this part you could ask, your comparison is different with this github, I don't know why, I just used the original base model to run this task, got the score is up, and I used same parameters and distilled model to run this task, got the score is up. Maybe I used the different parameters.

But as you can see, in the same situation, the distilled model has improvement than the original model.

How To Train

  • create pretraining data
export DATA_DIR=YOUR_DATA_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR
export VOCAB_FILE=YOUR_VOCAB_FILE

python create_pretraining_data.py \
        --input_dir=$DATA_DIR\
        --output_dir=$OUTPUT_DIR \
        --vocab_file=$YOUR_VOCAB_FILE \
        --do_whole_word_mask=True \
        --ramdom_next=True \
        --max_seq_length=512 \
        --max_predictions_per_seq=20 \
        --random_seed=12345 \
        --dupe_factor=5 \
        --masked_lm_prob=0.15 \
        --doc_stride=256 \
        --max_workers=2 \
        --short_seq_prob=0.1
  • create teacher output data
export TF_RECORDS=YOUR_PRETRAINING_TF_RECORDS
export TEACHER_MODEL=YOUR_TEACHER_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR

python create_teacher_output_data.py \
       --bert_config_file=$TEACHER_MODEL/bert_config.json \
       --input_file=$TF_RECORDS \
       --output_dir=$YOUR_OUTPUT_DIR \
       --truncation_factor=128 \
       --init_checkpoint=$TEACHER_MODEL\bert_model.ckpt \
       --max_seq_length=512 \
       --max_predictions_per_seq=20 \
       --predict_batch_size=64 
  • run distill
export TF_RECORDS=YOUR_TEACHER_OUTPUT_TF_RECORDS
export STUDENT_MODEL_DIR=YOUR_STUDENT_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR

python run_distill.py \
       --bert_config_file=$STUDENT_MODEL_DIR\bert_config.json \
       --input_file=$TF_RECORDS \
       --output_dir=$OUTPUT_DIR \
       --init_checkpoint=$STUDENT_MODEL_DIR\bert_model.ckpt
       --truncation_factor=128 \
       --max_seq_length=512 \
       --max_predictions_per_seq=20 \
       --do_train=True \
       --do_eval=True \
       --train_batch_size=384 \
       --eval_batch_size=1024 \
       --num_train_steps=1000000 \
       --num_warmup_steps=20000 

Answers

  • We need a small size one, your model are still base size.
  1. The purpose of punish this model is to identify feasibility of distilled of method.

  2. As you can see, this distilled method can improve the accuracy.

  • Why did you punish the 3 layers model?
  1. Some githuber told me, we need small size one, the bert base version is so large, I can't afford the cost of the server, so I punished the small size one!

Thanks

Thanks TFRC supports the TPU!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].