A Roberta-ext-wwm Distillation Model

This is a chinese Roberta wwm distillation model which was distilled from roberta-ext-wwm-large. The large model is from this github, thanks for his contribution.

Based On

This model was trained based on this paper, which was punished by huggingface.

Corpus

For train this model, I used baike_qa2019, news2016_zh, webtext_2019, wiki_zh. this data can be found in this github

Model Download

I just support BaiduYun to down this model, this link is below.

Model	BaiduYun
Roberta-wwm-ext-base-distill, Chinese	Tensorflow
Roberta-wwm-ext-large-3layers-distill, Chinese	Tensorflow 26hu
Roberta-wwm-ext-large-6layers-distill, Chinese	Tensorflow seou

Train Detail

To train this model, I used 2 steps.

I used roberta_ext_wwm_large model to get all examples tokens' output.
I used the output to train the model, which inited roberta_ext_wwm_base pretrain model weights.

Dataset

I just used 5 different ways to mask one sentence, not dynamic mask.
Every example just use maximum 20 token masks

Teacher Model

I used Roberta large model to get every masked token's output, which was mapped to vocab, I just kept max 128 dimensions, you could ask why didn't you keep more dimensions, first, the storge is too much, second, I think keep too much is unneccessary.

Student Model

Loss: In this training, I use 2 loss functions, first is cross entropy, second is cosin loss, add them together, I think it has a big improvement if I use another loss function, but I didn't have too much resource to train this model, because my free Google TPU expired.
Other Parameters

Parameter	batch size	learning rate	training step	warming step
Roberta-wwm-ext-base-distill, Chinese	384	5e-5	1M	2W
Roberta-wwm-ext-large-3layers-distill, Chinese	128	3e-5	3M	2.5K
Roberta-wwm-ext-large-6layers-distill, Chinese	512	8e-5	1M	5K

Comparison

In this part, every task I just ran one time, the result is below.

Classification

Model	AFQMC	CMNLI	TNEWS
Roberta-wwm-ext-base, Chinese	74.04%	80.51%	56.94%
Roberta-wwm-ext-base-distill, Chinese	74.44%	81.1%	57.6%
Roberta-wwm-ext-large-3layers-distill, Chinese	68.8%	75.5%	55.7%
Roberta-wwm-ext-large-6layers-distill, Chinese	72%	79.3%	56.7%

Model	LCQMC dev	LCQMC test
Roberta-wwm-ext-base, Chinese	89%	86.5%
Roberta-wwm-ext-base-distill, Chinese	89%	87.2%
Roberta-wwm-ext-large-3layers-distill, Chinese	85.1%	86%
Roberta-wwm-ext-large-6layers-distill, Chinese	87.7%	86.7%

SQUAD

Model	CMRC2018 dev (F1/EM)
Roberta-wwm-ext-base, Chinese	84.72%/65.24%
Roberta-wwm-ext-base-distill, Chinese	85.2%/65.20%
Roberta-wwm-ext-large-3layers-distill, Chinese	78.5%/57.4%
Roberta-wwm-ext-large-6layers-distill, Chinese	82.6%/61.7%

In this part you could ask, your comparison is different with this github, I don't know why, I just used the original base model to run this task, got the score is up, and I used same parameters and distilled model to run this task, got the score is up. Maybe I used the different parameters.

But as you can see, in the same situation, the distilled model has improvement than the original model.

How To Train

create pretraining data

export DATA_DIR=YOUR_DATA_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR
export VOCAB_FILE=YOUR_VOCAB_FILE

python create_pretraining_data.py \
        --input_dir=$DATA_DIR\
        --output_dir=$OUTPUT_DIR \
        --vocab_file=$YOUR_VOCAB_FILE \
        --do_whole_word_mask=True \
        --ramdom_next=True \
        --max_seq_length=512 \
        --max_predictions_per_seq=20 \
        --random_seed=12345 \
        --dupe_factor=5 \
        --masked_lm_prob=0.15 \
        --doc_stride=256 \
        --max_workers=2 \
        --short_seq_prob=0.1

create teacher output data

export TF_RECORDS=YOUR_PRETRAINING_TF_RECORDS
export TEACHER_MODEL=YOUR_TEACHER_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR

python create_teacher_output_data.py \
       --bert_config_file=$TEACHER_MODEL/bert_config.json \
       --input_file=$TF_RECORDS \
       --output_dir=$YOUR_OUTPUT_DIR \
       --truncation_factor=128 \
       --init_checkpoint=$TEACHER_MODEL\bert_model.ckpt \
       --max_seq_length=512 \
       --max_predictions_per_seq=20 \
       --predict_batch_size=64

run distill

export TF_RECORDS=YOUR_TEACHER_OUTPUT_TF_RECORDS
export STUDENT_MODEL_DIR=YOUR_STUDENT_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR

python run_distill.py \
       --bert_config_file=$STUDENT_MODEL_DIR\bert_config.json \
       --input_file=$TF_RECORDS \
       --output_dir=$OUTPUT_DIR \
       --init_checkpoint=$STUDENT_MODEL_DIR\bert_model.ckpt
       --truncation_factor=128 \
       --max_seq_length=512 \
       --max_predictions_per_seq=20 \
       --do_train=True \
       --do_eval=True \
       --train_batch_size=384 \
       --eval_batch_size=1024 \
       --num_train_steps=1000000 \
       --num_warmup_steps=20000

Answers

We need a small size one, your model are still base size.

The purpose of punish this model is to identify feasibility of distilled of method.
As you can see, this distilled method can improve the accuracy.

Why did you punish the 3 layers model?

Some githuber told me, we need small size one, the bert base version is so large, I can't afford the cost of the server, so I punished the small size one!

Thanks

Thanks TFRC supports the TPU!

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

xiongma / roberta-wwm-base-distill

Programming Languages

Labels

Projects that are alternatives of or similar to roberta-wwm-base-distill