PyTorch implementation of Video Transformer Benchmarks
This repository is mainly built upon Pytorch and Pytorch-Lightning. We wish to maintain a collections of scalable video transformer benchmarks, and discuss the training recipes of how to train a big video transformer model.
Now, we implement the TimeSformer, ViViT and MaskFeat. And we have pre-trained the TimeSformer-B
, ViViT-B
and MaskFeat
on Kinetics400/600, but still can't guarantee the performance reported in the paper. However, we find some relevant hyper-parameters which may help us to reach the target performance.
Update
- We have fixed serval known issues and now can build script to pretrain
MViT-B
withMaskFeat
or finetuneMViT-B
/TimeSformer-B
/ViViT-B
on K400. - We have reimplemented the methods of hog extraction and hog prediction in MaskFeat which are currently more efficient to pretrain.
- Note that if someone want to train
TimeSformer-B
orViViT-B
with current repo, they need to carefully adjust the learning rate and weight decay for a better performance. For example, you can can choose 0.005 for peak learning rate and 0.0001 for weight decay by default.
Table of Contents
Difference
In order to share the basic divided spatial-temporal attention module to different video transformer, we make some changes in the following apart.
1. Position embedding
We split the position embedding
from R(nt*h*w×d) mentioned in the ViViT paper into R(nh*w×d)
and R(nt×d) to stay the same as TimeSformer.
2. Class token
In order to make clear whether to add the class_token
into the module forward computation, we only compute the interaction between class_token
and query
when the current layer is the last layer (except FFN
) of each transformer block.
3. Initialize from the pre-trained model
- Tokenization: the token embedding filter can be chosen either
Conv2D
orConv3D
, and the initializing weights ofConv3D
filters fromConv2D
can be replicated along temporal dimension and averaging them or initialized with zeros along the temporal positions except at the centert/2
. - Temporal
MSA
module weights: one can choose to copy the weights from spatialMSA
module or initialize all weights with zeros. - Initialize from the
MAE
pre-trained model provided by ZhiLiang, where the class_token that does not appear in theMAE
pre-train model is initialized from truncated normal distribution. - Initialize from the
ViT
pre-trained model can be found here.
TODO
- [√] add more
TimeSformer
andViViT
variants pre-trained weights.- A larger version and other operation types.
- [√] add
linear prob
andfinetune recipe
.- Make available to transfer the pre-trained model to downstream task.
- add more scalable Video Transformer benchmarks.
- We will mainly focus on the data-efficient models.
- add more robust objective functions.
- Pre-train the model through the dominated self-supervised methods, e.g Mask Image Modeling.
Setup
pip install -r requirements.txt
Usage
Training
# path to Kinetics400 train set and val set
TRAIN_DATA_PATH='/path/to/Kinetics400/train_list.txt'
VAL_DATA_PATH='/path/to/Kinetics400/val_list.txt'
# path to root directory
ROOT_DIR='/path/to/work_space'
# path to pretrain weights
PRETRAIN_WEIGHTS='/path/to/weights'
# pretrain mvit using maskfeat
python model_pretrain.py \
-lr 8e-4 -epoch 300 -batch_size 16 -num_workers 8 -frame_interval 4 -num_frames 16 -num_class 400 \
-root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH
# finetune mvit with maskfeat pretrain weights
python model_pretrain.py \
-lr 0.005 -epoch 200 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 4 -num_class 400 \
-arch 'mvit' -optim_type 'adamw' -lr_schedule 'cosine' -objective 'supervised' -mixup True \
-auto_augment 'rand_aug' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS
# finetune timesformer with imagenet pretrain weights
python model_pretrain.py \
-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 8 -frame_interval 32 -num_class 400 \
-arch 'timesformer' -attention_type 'divided_space_time' -optim_type 'sgd' -lr_schedule 'cosine' \
-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'
# finetune vivit with imagenet pretrain weights
python model_pretrain.py \
-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 16 -num_class 400 \
-arch 'vivit' -attention_type 'fact_encoder' -optim_type 'sgd' -lr_schedule 'cosine' \
-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'
The minimal folder structure will look like as belows.
root_dir
├── results
│ ├── experiment_tag
│ │ ├── ckpt
│ │ ├── log
Result
Kinetics-400/600
1. Model Zoo
name | weights from | dataset | epochs | num frames | spatial crop | top1_acc | top5_acc | weight | log |
---|---|---|---|---|---|---|---|---|---|
TimeSformer-B | ImageNet-21K | K600 | 15e | 8 | 224 | 78.4 | 93.6 | Google drive or BaiduYun(code: yr4j) | log |
ViViT-B | ImageNet-21K | K400 | 30e | 16 | 224 | 75.2 | 91.5 | Google drive | |
MaskFeat | from scratch | K400 | 100e | 16 | 224 | Google drive |
1.1 Visualize
For each column, we show the masked input(left), HOG predictions(middle) and original video frame(right).
Here, we show the extracted attention map of a random frame sampled from the demo video.
2. Train Recipe(ablation study)
2.1 Acc
operation | top1_acc | top5_acc | top1_acc (three crop) |
---|---|---|---|
base | 68.2 | 87.6 | - |
+ frame_interval 4 -> 16 (span more time) |
72.9(+4.7) | 91.0(+3.4) | - |
+ RandomCrop, flip (overcome overfit) | 75.7(+2.8) | 92.5(+1.5) | - |
+ batch size 16 -> 8 (more iterations) |
75.8(+0.1) | 92.4(-0.1) | - |
+ frame_interval 16 -> 24 (span more time) |
77.7(+1.9) | 93.3(+0.9) | 78.4 |
+ frame_interval 24 -> 32 (span more time) |
78.4(+0.7) | 94.0(+0.7) | 79.1 |
tips: frame_interval
and data augment
counts for the validation accuracy.
2.2 Time
operation | epoch_time |
---|---|
base (start with DDP) | 9h+ |
+ speed up training recipes |
1h+ |
+ switch from get_batch first to sample_Indice first |
0.5h |
+ batch size 16 -> 8 |
33.32m |
+ num_workers 8 -> 4 |
35.52m |
+ frame_interval 16 -> 24 |
44.35m |
tips: Improve the frame_interval
will drop a lot on time performance.
1.speed up training recipes
:
- More GPU device.
pin_memory=True
.- Avoid CPU->GPU Device transfer (such as
.item()
,.numpy()
,.cpu()
operations on tensor orlog
to disk).
2.get_batch first
means that we firstly read all frames through the video reader, and then get the target slice of frames, so it largely slow down the data-loading speed.
Acknowledge
this repo is built on top of Pytorch-Lightning, pytorchvideo, skimage, decord and kornia. I also learn many code designs from MMaction2. I thank the authors for releasing their code.
Contribution
I look forward to seeing one can provide some ideas about the repo, please feel free to report it in the issue, or even better, submit a pull request.
And your star is my motivation, thank u~