All Projects → mx-mark → VideoTransformer-pytorch

mx-mark / VideoTransformer-pytorch

Licence: other
PyTorch implementation of a collections of scalable Video Transformer Benchmarks.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to VideoTransformer-pytorch

ConSSL
PyTorch Implementation of SOTA SSL methods
Stars: ✭ 61 (-61.64%)
Mutual labels:  pytorch-implmention, pytorch-lightning
TadTR
End-to-end Temporal Action Detection with Transformer. [Under review for a journal publication]
Stars: ✭ 55 (-65.41%)
Mutual labels:  transformer, action-recognition
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+2049.69%)
Mutual labels:  transformer, pytorch-implmention
Neural-Machine-Translation
Several basic neural machine translation models implemented by PyTorch & TensorFlow
Stars: ✭ 29 (-81.76%)
Mutual labels:  transformer, pytorch-implmention
TianChi AIEarth
TianChi AIEarth Contest Solution
Stars: ✭ 57 (-64.15%)
Mutual labels:  transformer, timesformer
Fast-AgingGAN
A deep learning model to age faces in the wild, currently runs at 60+ fps on GPUs
Stars: ✭ 133 (-16.35%)
Mutual labels:  pytorch-lightning
CharLM
Character-aware Neural Language Model implemented by PyTorch
Stars: ✭ 32 (-79.87%)
Mutual labels:  pytorch-implmention
Cross-lingual-Summarization
Zero-Shot Cross-Lingual Abstractive Sentence Summarization through Teaching Generation and Attention
Stars: ✭ 28 (-82.39%)
Mutual labels:  transformer
weakly-action-localization
No description or website provided.
Stars: ✭ 30 (-81.13%)
Mutual labels:  action-recognition
C3D-tensorflow
Action recognition with C3D network implemented in tensorflow
Stars: ✭ 34 (-78.62%)
Mutual labels:  action-recognition
TokenLabeling
Pytorch implementation of "All Tokens Matter: Token Labeling for Training Better Vision Transformers"
Stars: ✭ 385 (+142.14%)
Mutual labels:  transformer
cape
Continuous Augmented Positional Embeddings (CAPE) implementation for PyTorch
Stars: ✭ 29 (-81.76%)
Mutual labels:  transformer
Variational-Transformer
Variational Transformers for Diverse Response Generation
Stars: ✭ 79 (-50.31%)
Mutual labels:  transformer
libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
Stars: ✭ 284 (+78.62%)
Mutual labels:  transformer
bLVNet-TAM
The official Codes for NeurIPS 2019 paper. Quanfu Fan, Ricarhd Chen, Hilde Kuehne, Marco Pistoia, David Cox, "More Is Less: Learning Efficient Video Representations by Temporal Aggregation Modules"
Stars: ✭ 54 (-66.04%)
Mutual labels:  action-recognition
project-code-py
Leetcode using AI
Stars: ✭ 100 (-37.11%)
Mutual labels:  transformer
sparql-transformer
A more handy way to use SPARQL data in your web app
Stars: ✭ 38 (-76.1%)
Mutual labels:  transformer
MiCT-Net-PyTorch
Video Recognition using Mixed Convolutional Tube (MiCT) on PyTorch with a ResNet backbone
Stars: ✭ 48 (-69.81%)
Mutual labels:  action-recognition
sister
SImple SenTence EmbeddeR
Stars: ✭ 66 (-58.49%)
Mutual labels:  transformer
TransBTS
This repo provides the official code for : 1) TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/abs/2103.04430) , accepted by MICCAI2021. 2) TransBTSV2: Towards Better and More Efficient Volumetric Segmentation of Medical Images(https://arxiv.org/abs/2201.12785).
Stars: ✭ 254 (+59.75%)
Mutual labels:  transformer

PyTorch implementation of Video Transformer Benchmarks

This repository is mainly built upon Pytorch and Pytorch-Lightning. We wish to maintain a collections of scalable video transformer benchmarks, and discuss the training recipes of how to train a big video transformer model.

Now, we implement the TimeSformer, ViViT and MaskFeat. And we have pre-trained the TimeSformer-B, ViViT-B and MaskFeat on Kinetics400/600, but still can't guarantee the performance reported in the paper. However, we find some relevant hyper-parameters which may help us to reach the target performance.

Update

  1. We have fixed serval known issues and now can build script to pretrain MViT-B with MaskFeat or finetune MViT-B/TimeSformer-B/ViViT-B on K400.
  2. We have reimplemented the methods of hog extraction and hog prediction in MaskFeat which are currently more efficient to pretrain.
  3. Note that if someone want to train TimeSformer-B or ViViT-B with current repo, they need to carefully adjust the learning rate and weight decay for a better performance. For example, you can can choose 0.005 for peak learning rate and 0.0001 for weight decay by default.

Table of Contents

  1. Difference
  2. TODO
  3. Setup
  4. Usage
  5. Result
  6. Acknowledge
  7. Contribution

Difference

In order to share the basic divided spatial-temporal attention module to different video transformer, we make some changes in the following apart.

1. Position embedding

We split the position embedding from R(nt*h*w×d) mentioned in the ViViT paper into R(nh*w×d) and R(nt×d) to stay the same as TimeSformer.

2. Class token

In order to make clear whether to add the class_token into the module forward computation, we only compute the interaction between class_token and query when the current layer is the last layer (except FFN) of each transformer block.

3. Initialize from the pre-trained model

  • Tokenization: the token embedding filter can be chosen either Conv2D or Conv3D, and the initializing weights of Conv3D filters from Conv2D can be replicated along temporal dimension and averaging them or initialized with zeros along the temporal positions except at the center t/2.
  • Temporal MSA module weights: one can choose to copy the weights from spatial MSA module or initialize all weights with zeros.
  • Initialize from the MAE pre-trained model provided by ZhiLiang, where the class_token that does not appear in the MAE pre-train model is initialized from truncated normal distribution.
  • Initialize from the ViT pre-trained model can be found here.

TODO

  • [√] add more TimeSformer and ViViT variants pre-trained weights.
    • A larger version and other operation types.
  • [√] add linear prob and finetune recipe.
    • Make available to transfer the pre-trained model to downstream task.
  • add more scalable Video Transformer benchmarks.
    • We will mainly focus on the data-efficient models.
  • add more robust objective functions.

Setup

pip install -r requirements.txt

Usage

Training

# path to Kinetics400 train set and val set
TRAIN_DATA_PATH='/path/to/Kinetics400/train_list.txt'
VAL_DATA_PATH='/path/to/Kinetics400/val_list.txt'
# path to root directory
ROOT_DIR='/path/to/work_space'
# path to pretrain weights
PRETRAIN_WEIGHTS='/path/to/weights'

# pretrain mvit using maskfeat
python model_pretrain.py \
	-lr 8e-4 -epoch 300 -batch_size 16 -num_workers 8 -frame_interval 4 -num_frames 16 -num_class 400 \
	-root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH

# finetune mvit with maskfeat pretrain weights
python model_pretrain.py \
	-lr 0.005 -epoch 200 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 4 -num_class 400 \
	-arch 'mvit' -optim_type 'adamw' -lr_schedule 'cosine' -objective 'supervised' -mixup True \
	-auto_augment 'rand_aug' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
	-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS

# finetune timesformer with imagenet pretrain weights
python model_pretrain.py \
	-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 8 -frame_interval 32 -num_class 400 \
	-arch 'timesformer' -attention_type 'divided_space_time' -optim_type 'sgd' -lr_schedule 'cosine' \
	-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
	-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'

# finetune vivit with imagenet pretrain weights
python model_pretrain.py \
	-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 16 -num_class 400 \
	-arch 'vivit' -attention_type 'fact_encoder' -optim_type 'sgd' -lr_schedule 'cosine' \
	-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
	-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'

The minimal folder structure will look like as belows.

root_dir
├── results
│   ├── experiment_tag
│   │   ├── ckpt
│   │   ├── log

Result

Kinetics-400/600

1. Model Zoo

name weights from dataset epochs num frames spatial crop top1_acc top5_acc weight log
TimeSformer-B ImageNet-21K K600 15e 8 224 78.4 93.6 Google drive or BaiduYun(code: yr4j) log
ViViT-B ImageNet-21K K400 30e 16 224 75.2 91.5 Google drive
MaskFeat from scratch K400 100e 16 224 Google drive

1.1 Visualize

For each column, we show the masked input(left), HOG predictions(middle) and original video frame(right).

Here, we show the extracted attention map of a random frame sampled from the demo video.


2. Train Recipe(ablation study)

2.1 Acc

operation top1_acc top5_acc top1_acc (three crop)
base 68.2 87.6 -
+ frame_interval 4 -> 16 (span more time) 72.9(+4.7) 91.0(+3.4) -
+ RandomCrop, flip (overcome overfit) 75.7(+2.8) 92.5(+1.5) -
+ batch size 16 -> 8 (more iterations) 75.8(+0.1) 92.4(-0.1) -
+ frame_interval 16 -> 24 (span more time) 77.7(+1.9) 93.3(+0.9) 78.4
+ frame_interval 24 -> 32 (span more time) 78.4(+0.7) 94.0(+0.7) 79.1

tips: frame_interval and data augment counts for the validation accuracy.


2.2 Time

operation epoch_time
base (start with DDP) 9h+
+ speed up training recipes 1h+
+ switch from get_batch first to sample_Indice first 0.5h
+ batch size 16 -> 8 33.32m
+ num_workers 8 -> 4 35.52m
+ frame_interval 16 -> 24 44.35m

tips: Improve the frame_interval will drop a lot on time performance.

1.speed up training recipes:

  • More GPU device.
  • pin_memory=True.
  • Avoid CPU->GPU Device transfer (such as .item(), .numpy(), .cpu() operations on tensor or log to disk).

2.get_batch first means that we firstly read all frames through the video reader, and then get the target slice of frames, so it largely slow down the data-loading speed.


Acknowledge

this repo is built on top of Pytorch-Lightning, pytorchvideo, skimage, decord and kornia. I also learn many code designs from MMaction2. I thank the authors for releasing their code.

Contribution

I look forward to seeing one can provide some ideas about the repo, please feel free to report it in the issue, or even better, submit a pull request.

And your star is my motivation, thank u~

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].