All Projects → allenai → ViRB

allenai / ViRB

Licence: Apache-2.0 license
Visual Representation Learning Benchmark for Self-Supervised Models

Programming Languages

python
139335 projects - #7 most used programming language

ViRB

ViRB is a framework for evaluating the quality of representations learned by visual encoders on a variety of downstream tasks. It is the codebase used by the paper Contrasting Contrastive Self-Supervised Representation Learning Pipelines. As this is a tool for evaluating the learned representations, it is designed to freeze the encoder weights and only train a small end task network using latent representations on the train set for each task and evaluate it on the test set for that task. To speed this process up, the train and test set are pre encoded for most of the end tasks and stored in GPU memory for efficient usage. Fine tuning the encoder is also supported but takes significantly more time. ViRB is fully implemented in pyTorch and automatically scales to as many GPUs as are available on your machine. It has support for evaluating any pyTorch model architecture on a select subset of tasks.

Installation

To install the codebase simply clone this repository from github and run setup:

git clone https://github.com/klemenkotar/ViRB
cd ViRB
pip install -r requirements.txt

Quick Start

For a quick starting example we will train an end task network on the simple CalTech classification task using the SWAV 800 encoder.

First we need to download the encoder:

mkdir pretrained_weights
wget https://prior-model-weights.s3.us-east-2.amazonaws.com/contrastive_encoders/SWAV_800.pt 
mv SWAV_800.pt pretrained_weights/

Then we need to download the CalTech dataset from here. After extracting it you should have a directory named 101_ObjectCategories. Rename it to data/caltech/.

Now we are ready to start the training run with the following command:

python main.py --experiment_list=configs/experiment_lists/swav.yaml --virb_configs=configs/virb_configs/caltech.yaml

The codebase will automatically use a GPU if one is available on the machine. The progress will be printed on the screen along with an ETA for completion.

Live tensorboard logs can be acessed by running the following command:

tensorboard --logdir=out

Once the training is complete the task head model and results json file will be stored in the out/ directory.

Dataset Download

To run the full suit of end tasks we need to download all the associated datasets. All the datasets should be stored in a folder called data/ inside the root project directory. Bellow is a table of links where the data can be downloaded and the names of directories they should be placed in.

Due to the complex nature and diversity of dataset licensing we provide 4 types of links: Data which is a direct link to a compressed file that can be downloaded from the internet, Website which is a link to a website where some instructions can be followed to download the data in question, JSON which is a link to a supplementary JSON file which adds some metadata on top of another existing dataset and txt which contain lists of resources that need to be downloaded.

Dataset Name Dataset Size Directory Download Link Size Note
ImageNet Cls. 1,281,167 data/imagenet/ Website 126.2 GB
Pets Cls. 3,680 data/pets/ Data 0.82 GB
CalTech Cls. 3,060 data/caltech-101/ Data 0.14 GB
CIFAR-100 Cls. 50,000 data/cifar-100/ Data 0.19 GB
SUN Scene Cls. 87,003 data/SUN397/ Data 38.0 GB
Eurosat Cls. 21,600 data/eurosat/ Data 0.1 GB
dtd Cls. 3,760 data/dtd/ Data 0.63 GB
Kinetics Action Pred. 50,000 data/kinetics400/ Website 0.63 GB
CLEVR Count 70,000 data/CLEVR/ Data 20.0 GB
THOR Num. Steps 60,000 data/thor_num_steps/ Data 0.66 GB
THOR Egomotion 60,000 data/thor_action_prediction/ Data 1.3 GB
nuScenes Egomotion 28,000 data/nuScenes/ Website JSON JSON 53.43 GB Download samples and sweeps
Cityscapes Seg. 3,475 data/cityscapes/ Website 61.89 GB
Pets Instance Seg. 3,680 data/pets/ Data Masks 0.82 GB
EgoHands Seg. 4,800 data/egohands/ Data 1.35 GB
THOR Depth 60,000 data/thor_depth_prediction/ Data 0.25 GB
Taskonomy Depth 39,995 data/taskonomy/ Link txt 48.09 GB Download the rgb and depth_zbuffer data for the scenes listed in txt
NYU Depth 1,159 data/nyu/ Data 5.62 GB Same data as NYU Walkable
NYU Walkable 1,159 data/nyu/ Data 5.62 GB Same data as NYU Walkable
KITTI Opt. Flow 200 data/KITTI/ Data 1.68 GB

Pre-trained Models

As part of our paper we trained several new encoders using a combination of training algorithms and datasets. Bellow is a table containing the download links to the weights. The weights are stored in standard pyTorch format. To work with this codebase, the models should be downloaded into a directory called pretrained_weights/ inside the root project directory.

Encoder Name Method Dataset Dataset Size Number of Epochs Link
SwAV ImageNet 100 SwAV ImageNet 1.3M 100 Link
SwAV ImageNet 50 SwAV ImageNet 1.3M 50 Link
SwAV Half ImageNet 200 SwAV ImageNet-1/2 0.5M 200 Link
SwAV Half ImageNet 100 SwAV ImageNet-1/2 0.5M 100 Link
SwAV Quarter ImageNet 200 SwAV ImageNet-1/4 0.25M 200 Link
SwAV Linear Unbalanced ImageNet 200 SwAV ImageNet-1/2-Lin 0.5M 200 Link
SwAV Linear Unbalanced ImageNet 100 SwAV ImageNet-1/2-Lin 0.5M 100 Link
SwAV Log Unbalanced ImageNet 200 SwAV ImageNet-1/4-Log 0.25M 200 Link
SwAV Places 200 SwAV Places 1.3M 200 Link
SwAV Kinetics 200 SwAV Kinetics 1.3M 200 Link
SwAV Taskonomy 200 SwAV Taskonomy 1.3M 200 Link
SwAV Combination 200 SwAV Combination 1.3M 200 Link
MoCov2 ImageNet 100 MoCov2 ImageNet 1.3M Yes Link
MoCov2 ImageNet 50 MoCov2 ImageNet 1.3M 50 Link
MoCov2 Half ImageNet 200 MoCov2 ImageNet-1/2 0.5M 200 Link
MoCov2 Half ImageNet 100 MoCov2 ImageNet-1/2 0.5M 100 Link
MoCov2 Quarter ImageNet 200 MoCov2 ImageNet-1/4 0.25M 200 Link
MoCov2 Linear Unbalanced ImageNet 200 MoCov2 ImageNet-1/2-Lin 0.5M 200 Link
MoCov2 Linear Unbalanced ImageNet 100 MoCov2 ImageNet-1/2-Lin 0.5M 100 Link
MoCov2 Log Unbalanced ImageNet 200 MoCov2 ImageNet-1/4-Log 0.25M 200 Link
MoCov2 Places 200 MoCov2 Places 1.3M 200 Link
MoCov2 Kinetics 200 MoCov2 Kinetics 1.3M 200 Link
MoCov2 Taskonomy 200 MoCov2 Taskonomy 1.3M 200 Link
MoCov2 Combination 200 MoCov2 Combination 1.3M 200 Link

We also used some models trained by third party authors. Bellow is a table of download links for their models and the scripts used to convert the weights from their format to ViRB format. All of the conversion scripts have the exact same usage: <SCRIPT_NAME> <DOWNLOADED_WEIGHT_FILE> <DESIRED_VIRB_FORMAT_OUTPUT_PATH>.

Encoder Name Method Dataset Dataset Size Number of Epochs Link Conversion Script
SwAV ImageNet 800 SwAV ImageNet 1.3M 800 Link scripts/swav_to_virb.py
SwAV ImageNet 200 SwAV ImageNet 1.3M 200 Link scripts/swav_to_virb.py
MoCov1 ImageNet 200 MoCov1 ImageNet 1.3M 200 Link scripts/moco_to_virb.py
MoCov2 ImageNet 800 MoCov2 ImageNet 1.3M 800 Link scripts/moco_to_virb.py
MoCov2 ImageNet 200 MoCov2 ImageNet 1.3M 200 Link scripts/moco_to_virb.py
PIRL ImageNet 800 PIRL ImageNet 1.3M 800 Link scripts/pirl_to_virb.py

End Task Training

ViRB supports 20 end task that are classified as Image-level or Pixelwise depending on the output modality of the task. Furthermore each task is also classified as either semantic or structural. Bellow is an illustration of the space of our tasks. For further details please see Contrasting Contrastive Self-Supervised Representation Learning Models.

Tasks

After installing the codebase and downloading the datasets and pretrained models we are ready to run our experiments. To reproduce every experiment in the paper run:

python main.py --experiment_list=configs/experiment_lists/all.yaml --virb_configs=configs/virb_configs/all.yaml

WARNING: this will take well over 1000 GPU hours to train so we suggest training a subset instead. We can see the results of all these training runs summarized in the graph bellow.

Results Correlation of end task performances with ImageNet classification accuracy. The plots show the end task performance against the ImageNet top-1 accuracy for all end tasks and encoders. Each point represents a different encoder trained with different algorithms and datasets. This reveals the lack of a strong correlation between the performance on ImageNet classification and tasks from other categories.

To specify which task we want to train we create a virb_config yaml file which defines the task name and training configuration. The file configs/virb_configs/all.yaml contains configurations for every task supported by this package so it is a good starting point. We can select only a few tasks to train and comment out the other configurations.

To specify which weights we want to use we specify an experiment list file. The file configs/experiment_lists/all.yaml contains all the model weights provided by this repository. We can select only a few models to train and comment out the other configurations. Alternatively we can add in new weights and add them to the list. All we have to do is make sure the weights are for a ResNet50 model stored in the standard pyTorch weight file.

Training a SWAV Encoder on the ImageNet End Task

To train a model using the SWAV encoder on the ImageNet classification end task download the ImageNet dataset from the link in the Dataset Download table above, and the SWAV Imagenet 800 model from the Pretrained-Models table above.

Then create a new file inside configs/virb_configs/ that contains just the ImageNet configuration:

Imagenet:
 task: "Imagenet"
 training_configs:
   adam-0.0001:
     optimizer: "adam"
     lr: 0.0001
 num_epochs: 100
 batch_size: 32

Then create a new file inside configs/experiment_lists/ that contains just the SWAV model:

SWAV_800: 'pretrained_weights/SWAV_800.pt'

Now run this configuration with the following command:

python main.py --experiment_list=configs/experiment_lists/EXPERIMENT_LIST_FILE_NAME.yaml --virb_configs=configs/virb_configs/VIRB_CONFIG_FILE_NAME.yaml

Hyperparameter Search

One feature offered by this codebase is the ability to train the end task networks using several sets of optimizers, schedulers and hyperparameters. For the Image-level tasks (which are encodable), the dataset will get encoded only once and then a model using each set of hyperparameters will get trained (to improve efficiency).

An example of a grid search configuration can be found in configs/virb_configs/imagenet_grid_search.yaml, and it looks like this:

Imagenet:
 task: "Imagenet"
 training_configs:
   adam-0.0001:
     optimizer: "adam"
     lr: 0.0001
   adam-0.001:
     optimizer: "adam"
     lr: 0.001
   sgd-0.01-StepLR:
     optimizer: "sgd"
     lr: 0.01
     scheduler:
       type: "StepLR"
       step_size: 50
       gamma: 0.1
   sgd-0.01-OneCycle:
     optimizer: "sgd"
     lr: 0.01
     scheduler:
       type: "OneCycle"
   sgd-0.01-Poly:
     optimizer: "sgd"
     lr: 0.001
     scheduler:
       type: "Poly"
       exponent: 0.9
 num_epochs: 100
 batch_size: 32

We spoecify each training config as a YAML object. The "sgd" and "adam" optimizers are supported as well as the "StepLR", "OneCycle" and "Poly" schedulers from pyTorch's optim package. All schedulers are compatible with all of the optimizers.

To execute this ImageNet grid search run:

python main.py --experiment_list=configs/experiment_lists/swav.yaml --virb_configs=configs/virb_configs/imagenet_grid_search.yaml

Testing Only Datasets

One aditional feature this codebase supports is datasets that are "eval only" and use a task head trained on a different task. The only currently supported example is ImageNet v2. To test the SWAV 800 model on ImageNetv2 first train at least one ImageNet end task head on SWAV 800 then run the following command:

python main.py --experiment_list=configs/experiment_lists/swav.yaml --virb_configs=configs/virb_configs/imagenetv2.yaml

Custom Models

All the encoders in the tutorials thus far have used the ResNet50 architecture, but we also support using custom encoders.

All of the Image-level tasks require the encoder outputs a dictionary with the key "embedding" mapping to a pyTorch tensor of size NxD where N is the batch size and D is the arbitrary embedding size.

All of the Pixelwise tasks require that the encoders output a dictionary with a tensor for the representation after every block. In practice this means that the model needs to output 5 tensors of sizes corresponding to the outputs of a ResNet50 conv, block1, block2, block3 and block4 layers.

To use a custom model simply modify main.py by replacing ResNet50Encoder with any encoder with the outputs mentioned above.

Citation

@inproceedings{kotar2021contrasting,
  title={Contrasting Contrastive Self-Supervised Representation Learning Pipelines},
  author={Klemen Kotar and Gabriel Ilharco and Ludwig Schmidt and Kiana Ehsani and Roozbeh Mottaghi},
  booktitle={ICCV},  
  year={2021},
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].