All Projects → raoyongming → DynamicViT

raoyongming / DynamicViT

Licence: MIT license
[NeurIPS 2021] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to DynamicViT

Evo-ViT
Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
Stars: ✭ 50 (-85.29%)
Mutual labels:  image-classification, vision-transformers
image-recognition-and-information-extraction-from-image-documents
Image Recognition and Information Extraction from Image Documents using Keras and Watson NLU
Stars: ✭ 71 (-79.12%)
Mutual labels:  image-classification
AlphaTree-graphic-deep-neural-network
AI Roadmap:机器学习(Machine Learning)、深度学习(Deep Learning)、对抗神经网络(GAN),图神经网络(GNN),NLP,大数据相关的发展路书(roadmap), 并附海量源码(python,pytorch)带大家消化基本知识点,突破面试,完成从新手到合格工程师的跨越,其中深度学习相关论文附有tensorflow caffe官方源码,应用部分含推荐算法和知识图谱
Stars: ✭ 2,221 (+553.24%)
Mutual labels:  image-classification
TNCR Dataset
Deep learning, Convolutional neural networks, Image processing, Document processing, Table detection, Page object detection, Table classification. https://www.sciencedirect.com/science/article/pii/S0925231221018142
Stars: ✭ 37 (-89.12%)
Mutual labels:  image-classification
favorite-research-papers
Listing my favorite research papers 📝 from different fields as I read them.
Stars: ✭ 12 (-96.47%)
Mutual labels:  image-classification
huggingpics
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.
Stars: ✭ 161 (-52.65%)
Mutual labels:  vision-transformers
MNIST
Handwritten digit recognizer using a feed-forward neural network and the MNIST dataset of 70,000 human-labeled handwritten digits.
Stars: ✭ 28 (-91.76%)
Mutual labels:  image-classification
DeTraC COVId19
Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network
Stars: ✭ 34 (-90%)
Mutual labels:  image-classification
deep-learning
Deep Learning Bootcamp
Stars: ✭ 60 (-82.35%)
Mutual labels:  image-classification
auslan-party
✌Real-time translation of the Auslan Alphabet
Stars: ✭ 38 (-88.82%)
Mutual labels:  image-classification
CGvsPhoto
Computer Graphics vs Real Photographic Images : A Deep-learning approach
Stars: ✭ 24 (-92.94%)
Mutual labels:  image-classification
Plant AI
Performing Leaf Image classification for Recognition of Plant Diseases using various types of CNN Architecture, For detection of Diseased Leaf and thus helping the increase in crop yield.
Stars: ✭ 36 (-89.41%)
Mutual labels:  image-classification
multi-task-learning
Multi-task learning smile detection, age and gender classification on GENKI4k, IMDB-Wiki dataset.
Stars: ✭ 154 (-54.71%)
Mutual labels:  image-classification
image-sorter2
One-click image sorting/labelling script
Stars: ✭ 65 (-80.88%)
Mutual labels:  image-classification
etiketai
Etiketai is an online tool designed to label images, useful for training AI models
Stars: ✭ 63 (-81.47%)
Mutual labels:  image-classification
ros2-tensorflow
ROS2 nodes for computer vision tasks in Tensorflow
Stars: ✭ 41 (-87.94%)
Mutual labels:  image-classification
esvit
EsViT: Efficient self-supervised Vision Transformers
Stars: ✭ 323 (-5%)
Mutual labels:  vision-transformers
SimplePythonCNN
Only in native python & numpy with Keras interface
Stars: ✭ 31 (-90.88%)
Mutual labels:  image-classification
super-gradients
Easily train or fine-tune SOTA computer vision models with one open source training library
Stars: ✭ 429 (+26.18%)
Mutual labels:  image-classification
tensorflow-image-classifier
Easily train an image classifier and then use it to label/tag other images
Stars: ✭ 29 (-91.47%)
Mutual labels:  image-classification

Efficient Vision Transformers and CNNs with Dynamic Spatial Sparsification

This repository contains PyTorch implementation for DynamicViT (NeurIPS 2021).

DynamicViT is a dynamic token sparsification framework to prune redundant tokens in vision transformers progressively and dynamically based on the input. Our method can reduces over 30% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers.

[Project Page] [arXiv (NeurIPS 2021)]

🔥Updates

We extend our method to more network architectures (i.e., ConvNeXt and Swin Transformers) and more tasks (i.e., object detection and semantic segmentation) with an improved dynamic spatial sparsification framework. Please refer to the extended version of our paper for details.

[arXiv (Extended Version)]

Image Examples

intro


Video Examples

result1

Model Zoo

We provide our DynamicViT models pretrained on ImageNet:

name model rho acc@1 acc@5 FLOPs url
DynamicViT-256/0.7 deit-256 0.7 76.53 93.12 1.3G Google Drive / Tsinghua Cloud
DynamicViT-384/0.7 deit-s 0.7 79.32 94.68 2.9G Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.5 lvvit-s 0.5 81.97 95.76 3.7G Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.7 lvvit-s 0.7 83.08 96.25 4.6G Google Drive / Tsinghua Cloud
DynamicViT-LV-M/0.7 lvvit-m 0.7 83.82 96.58 8.5G Google Drive / Tsinghua Cloud

🔥Updates: We provide our DynamicCNN and DynamicSwin models pretrained on ImageNet:

name model rho acc@1 acc@5 FLOPs url
DynamicCNN-T/0.7 convnext-t 0.7 81.59 95.72 3.6G Google Drive / Tsinghua Cloud
DynamicCNN-T/0.9 convnext-t 0.9 82.06 95.89 3.9G Google Drive / Tsinghua Cloud
DynamicCNN-S/0.7 convnext-s 0.7 82.57 96.29 5.8G Google Drive / Tsinghua Cloud
DynamicCNN-S/0.9 convnext-s 0.9 83.12 96.42 6.8G Google Drive / Tsinghua Cloud
DynamicCNN-B/0.7 convnext-b 0.7 83.45 96.56 10.2G Google Drive / Tsinghua Cloud
DynamicCNN-B/0.9 convnext-b 0.9 83.96 96.76 11.9G Google Drive / Tsinghua Cloud
DynamicSwin-T/0.7 swin-t 0.7 80.91 95.42 4.0G Google Drive / Tsinghua Cloud
DynamicSwin-S/0.7 swin-s 0.7 83.21 96.33 6.9G Google Drive / Tsinghua Cloud
DynamicSwin-B/0.7 swin-b 0.7 83.43 96.45 12.1G Google Drive / Tsinghua Cloud

Usage

Requirements

  • torch>=1.8.0
  • torchvision>=0.9.0
  • timm==0.3.2
  • tensorboardX
  • six
  • fvcore

Data preparation: download and extract ImageNet images from http://image-net.org/. The directory structure should be

│ILSVRC2012/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Model preparation: download pre-trained models if necessary:

model url model url
DeiT-Small link LVViT-S link
DeiT-Base link LVViT-M link
ConvNeXt-T link Swin-T link
ConvNeXt-S link Swin-S link
ConvNeXt-B link Swin-B link

Demo

You can try DynamicViT on Colab . Thank @dirtycomputer for the contribution.

We also provide a Jupyter notebook where you can run the visualization of DynamicViT.

To run the demo, you need to install matplotlib.

demo

Evaluation

To evaluate a pre-trained DynamicViT model on the ImageNet validation set with a single GPU, run:

python infer.py --data_path /path/to/ILSVRC2012/ --model model_name \
--model_path /path/to/model --base_rate 0.7 

Training

To train Dynamic Spatial Sparsification models on ImageNet, run:

(You can train models with different keeping ratio by adjusting base_rate. )

DeiT-S

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamicvit_deit-s --model deit-s --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 30 --base_rate 0.7 --lr 1e-3

DeiT-B

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamicvit_deit-b --model deit-b --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 30 --base_rate 0.7 --lr 1e-3

LV-ViT-S

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamicvit_lvvit-s --model lvvit-s --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 30 --base_rate 0.7 --lr 1e-3

LV-ViT-M

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamicvit_lvvit-m --model lvvit-m --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 30 --base_rate 0.7 --lr 1e-3

DynamicViT can also achieve comparable performance with only 15 epochs training (around 0.1% lower accuracy compared to 30 epochs).

ConvNeXt-T

Train on 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamic_conv-t --model convnext-t --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 4 --lr_scale 0.2

Train on 4 8-GPU nodes:

python run_with_submitit.py --nodes 4 --ngpus 8 --output_dir logs/dynamic_conv-t --model convnext-t --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 1 --lr_scale 0.2

ConvNeXt-S

Train on 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamic_conv-s --model convnext-s --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 4 --lr_scale 0.2

Train on 4 8-GPU nodes:

python run_with_submitit.py --nodes 4 --ngpus 8 --output_dir logs/dynamic_conv-s --model convnext-s --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 1 --lr_scale 0.2

ConvNeXt-B

Train on 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamic_conv-b --model convnext-b --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.5 --update_freq 4 --lr_scale 0.2

Train on 4 8-GPU nodes:

python run_with_submitit.py --nodes 4 --ngpus 8 --output_dir logs/dynamic_conv-b --model convnext-b --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.5 --update_freq 1 --lr_scale 0.2

Swin-T

Train on 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamic_swin-t --model swin-t --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 4 --lr_scale 0.2

Train on 4 8-GPU nodes:

python run_with_submitit.py --nodes 4 --ngpus 8 --output_dir logs/dynamic_swin-t --model swin-t --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 1 --lr_scale 0.2

Swin-S

Train on 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamic_swin-s --model swin-s --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 4 --lr_scale 0.2

Train on 4 8-GPU nodes:

python run_with_submitit.py --nodes 4 --ngpus 8 --output_dir logs/dynamic_swin-s --model swin-s --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.2 --update_freq 1 --lr_scale 0.2

Swin-B

Train on 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --output_dir logs/dynamic_swin-b --model swin-b --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.5 --update_freq 4 --lr_scale 0.2

Train on 4 8-GPU nodes:

python run_with_submitit.py --nodes 4 --ngpus 8 --output_dir logs/dynamic_swin-b --model swin-b --input_size 224 --batch_size 128 --data_path /path/to/ILSVRC2012/ --epochs 120 --base_rate 0.7 --lr 4e-3 --drop_path 0.5 --update_freq 1 --lr_scale 0.2

License

MIT License

Acknowledgements

Our code is based on pytorch-image-models, DeiT, LV-ViT, ConvNeXt and Swin-Transformer.

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{rao2021dynamicvit,
  title={DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification},
  author={Rao, Yongming and Zhao, Wenliang and Liu, Benlin and Lu, Jiwen and Zhou, Jie and Hsieh, Cho-Jui},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2021}
}
@article{rao2022dynamicvit,
  title={Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks},
  author={Rao, Yongming and Liu, Zuyan and Zhao, Wenliang and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2207.01580},
  year={2022}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].