ViRB

ViRB is a framework for evaluating the quality of representations learned by visual encoders on a variety of downstream tasks. It is the codebase used by the paper Contrasting Contrastive Self-Supervised Representation Learning Pipelines. As this is a tool for evaluating the learned representations, it is designed to freeze the encoder weights and only train a small end task network using latent representations on the train set for each task and evaluate it on the test set for that task. To speed this process up, the train and test set are pre encoded for most of the end tasks and stored in GPU memory for efficient usage. Fine tuning the encoder is also supported but takes significantly more time. ViRB is fully implemented in pyTorch and automatically scales to as many GPUs as are available on your machine. It has support for evaluating any pyTorch model architecture on a select subset of tasks.

Installation

To install the codebase simply clone this repository from github and run setup:

git clone https://github.com/klemenkotar/ViRB
cd ViRB
pip install -r requirements.txt

Quick Start

For a quick starting example we will train an end task network on the simple CalTech classification task using the SWAV 800 encoder.

First we need to download the encoder:

mkdir pretrained_weights
wget https://prior-model-weights.s3.us-east-2.amazonaws.com/contrastive_encoders/SWAV_800.pt 
mv SWAV_800.pt pretrained_weights/

Then we need to download the CalTech dataset from here. After extracting it you should have a directory named 101_ObjectCategories. Rename it to data/caltech/.

Now we are ready to start the training run with the following command:

python main.py --experiment_list=configs/experiment_lists/swav.yaml --virb_configs=configs/virb_configs/caltech.yaml

The codebase will automatically use a GPU if one is available on the machine. The progress will be printed on the screen along with an ETA for completion.

Live tensorboard logs can be acessed by running the following command:

tensorboard --logdir=out

Once the training is complete the task head model and results json file will be stored in the out/ directory.

Dataset Download

To run the full suit of end tasks we need to download all the associated datasets. All the datasets should be stored in a folder called data/ inside the root project directory. Bellow is a table of links where the data can be downloaded and the names of directories they should be placed in.

Due to the complex nature and diversity of dataset licensing we provide 4 types of links: Data which is a direct link to a compressed file that can be downloaded from the internet, Website which is a link to a website where some instructions can be followed to download the data in question, JSON which is a link to a supplementary JSON file which adds some metadata on top of another existing dataset and txt which contain lists of resources that need to be downloaded.

Dataset Name	Dataset Size	Directory	Download Link	Size	Note
ImageNet Cls.	1,281,167	data/imagenet/	Website	126.2 GB
Pets Cls.	3,680	data/pets/	Data	0.82 GB
CalTech Cls.	3,060	data/caltech-101/	Data	0.14 GB
CIFAR-100 Cls.	50,000	data/cifar-100/	Data	0.19 GB
SUN Scene Cls.	87,003	data/SUN397/	Data	38.0 GB
Eurosat Cls.	21,600	data/eurosat/	Data	0.1 GB
dtd Cls.	3,760	data/dtd/	Data	0.63 GB
Kinetics Action Pred.	50,000	data/kinetics400/	Website	0.63 GB
CLEVR Count	70,000	data/CLEVR/	Data	20.0 GB
THOR Num. Steps	60,000	data/thor_num_steps/	Data	0.66 GB
THOR Egomotion	60,000	data/thor_action_prediction/	Data	1.3 GB
nuScenes Egomotion	28,000	data/nuScenes/	Website JSON JSON	53.43 GB	Download samples and sweeps
Cityscapes Seg.	3,475	data/cityscapes/	Website	61.89 GB
Pets Instance Seg.	3,680	data/pets/	Data Masks	0.82 GB
EgoHands Seg.	4,800	data/egohands/	Data	1.35 GB
THOR Depth	60,000	data/thor_depth_prediction/	Data	0.25 GB
Taskonomy Depth	39,995	data/taskonomy/	Link txt	48.09 GB	Download the rgb and depth_zbuffer data for the scenes listed in txt
NYU Depth	1,159	data/nyu/	Data	5.62 GB	Same data as NYU Walkable
NYU Walkable	1,159	data/nyu/	Data	5.62 GB	Same data as NYU Walkable
KITTI Opt. Flow	200	data/KITTI/	Data	1.68 GB

Pre-trained Models

As part of our paper we trained several new encoders using a combination of training algorithms and datasets. Bellow is a table containing the download links to the weights. The weights are stored in standard pyTorch format. To work with this codebase, the models should be downloaded into a directory called pretrained_weights/ inside the root project directory.

Encoder Name	Method	Dataset	Dataset Size	Number of Epochs	Link
SwAV ImageNet 100	SwAV	ImageNet	1.3M	100	Link
SwAV ImageNet 50	SwAV	ImageNet	1.3M	50	Link
SwAV Half ImageNet 200	SwAV	ImageNet-1/2	0.5M	200	Link
SwAV Half ImageNet 100	SwAV	ImageNet-1/2	0.5M	100	Link
SwAV Quarter ImageNet 200	SwAV	ImageNet-1/4	0.25M	200	Link
SwAV Linear Unbalanced ImageNet 200	SwAV	ImageNet-1/2-Lin	0.5M	200	Link
SwAV Linear Unbalanced ImageNet 100	SwAV	ImageNet-1/2-Lin	0.5M	100	Link
SwAV Log Unbalanced ImageNet 200	SwAV	ImageNet-1/4-Log	0.25M	200	Link
SwAV Places 200	SwAV	Places	1.3M	200	Link
SwAV Kinetics 200	SwAV	Kinetics	1.3M	200	Link
SwAV Taskonomy 200	SwAV	Taskonomy	1.3M	200	Link
SwAV Combination 200	SwAV	Combination	1.3M	200	Link
MoCov2 ImageNet 100	MoCov2	ImageNet	1.3M	Yes	Link
MoCov2 ImageNet 50	MoCov2	ImageNet	1.3M	50	Link
MoCov2 Half ImageNet 200	MoCov2	ImageNet-1/2	0.5M	200	Link
MoCov2 Half ImageNet 100	MoCov2	ImageNet-1/2	0.5M	100	Link
MoCov2 Quarter ImageNet 200	MoCov2	ImageNet-1/4	0.25M	200	Link
MoCov2 Linear Unbalanced ImageNet 200	MoCov2	ImageNet-1/2-Lin	0.5M	200	Link
MoCov2 Linear Unbalanced ImageNet 100	MoCov2	ImageNet-1/2-Lin	0.5M	100	Link
MoCov2 Log Unbalanced ImageNet 200	MoCov2	ImageNet-1/4-Log	0.25M	200	Link
MoCov2 Places 200	MoCov2	Places	1.3M	200	Link
MoCov2 Kinetics 200	MoCov2	Kinetics	1.3M	200	Link
MoCov2 Taskonomy 200	MoCov2	Taskonomy	1.3M	200	Link
MoCov2 Combination 200	MoCov2	Combination	1.3M	200	Link

We also used some models trained by third party authors. Bellow is a table of download links for their models and the scripts used to convert the weights from their format to ViRB format. All of the conversion scripts have the exact same usage: <SCRIPT_NAME> <DOWNLOADED_WEIGHT_FILE> <DESIRED_VIRB_FORMAT_OUTPUT_PATH>.

Encoder Name	Method	Dataset	Dataset Size	Number of Epochs	Link	Conversion Script
SwAV ImageNet 800	SwAV	ImageNet	1.3M	800	Link	scripts/swav_to_virb.py
SwAV ImageNet 200	SwAV	ImageNet	1.3M	200	Link	scripts/swav_to_virb.py
MoCov1 ImageNet 200	MoCov1	ImageNet	1.3M	200	Link	scripts/moco_to_virb.py
MoCov2 ImageNet 800	MoCov2	ImageNet	1.3M	800	Link	scripts/moco_to_virb.py
MoCov2 ImageNet 200	MoCov2	ImageNet	1.3M	200	Link	scripts/moco_to_virb.py
PIRL ImageNet 800	PIRL	ImageNet	1.3M	800	Link	scripts/pirl_to_virb.py

End Task Training

ViRB supports 20 end task that are classified as Image-level or Pixelwise depending on the output modality of the task. Furthermore each task is also classified as either semantic or structural. Bellow is an illustration of the space of our tasks. For further details please see Contrasting Contrastive Self-Supervised Representation Learning Models.

After installing the codebase and downloading the datasets and pretrained models we are ready to run our experiments. To reproduce every experiment in the paper run:

python main.py --experiment_list=configs/experiment_lists/all.yaml --virb_configs=configs/virb_configs/all.yaml

WARNING: this will take well over 1000 GPU hours to train so we suggest training a subset instead. We can see the results of all these training runs summarized in the graph bellow.

Correlation of end task performances with ImageNet classification accuracy. The plots show the end task performance against the ImageNet top-1 accuracy for all end tasks and encoders. Each point represents a different encoder trained with different algorithms and datasets. This reveals the lack of a strong correlation between the performance on ImageNet classification and tasks from other categories.

To specify which task we want to train we create a virb_config yaml file which defines the task name and training configuration. The file configs/virb_configs/all.yaml contains configurations for every task supported by this package so it is a good starting point. We can select only a few tasks to train and comment out the other configurations.

To specify which weights we want to use we specify an experiment list file. The file configs/experiment_lists/all.yaml contains all the model weights provided by this repository. We can select only a few models to train and comment out the other configurations. Alternatively we can add in new weights and add them to the list. All we have to do is make sure the weights are for a ResNet50 model stored in the standard pyTorch weight file.

Training a SWAV Encoder on the ImageNet End Task

To train a model using the SWAV encoder on the ImageNet classification end task download the ImageNet dataset from the link in the Dataset Download table above, and the SWAV Imagenet 800 model from the Pretrained-Models table above.

Then create a new file inside configs/virb_configs/ that contains just the ImageNet configuration:

Imagenet:
 task: "Imagenet"
 training_configs:
   adam-0.0001:
     optimizer: "adam"
     lr: 0.0001
 num_epochs: 100
 batch_size: 32

Then create a new file inside configs/experiment_lists/ that contains just the SWAV model:

SWAV_800: 'pretrained_weights/SWAV_800.pt'

Now run this configuration with the following command:

python main.py --experiment_list=configs/experiment_lists/EXPERIMENT_LIST_FILE_NAME.yaml --virb_configs=configs/virb_configs/VIRB_CONFIG_FILE_NAME.yaml

Hyperparameter Search

One feature offered by this codebase is the ability to train the end task networks using several sets of optimizers, schedulers and hyperparameters. For the Image-level tasks (which are encodable), the dataset will get encoded only once and then a model using each set of hyperparameters will get trained (to improve efficiency).

An example of a grid search configuration can be found in configs/virb_configs/imagenet_grid_search.yaml, and it looks like this:

Imagenet:
 task: "Imagenet"
 training_configs:
   adam-0.0001:
     optimizer: "adam"
     lr: 0.0001
   adam-0.001:
     optimizer: "adam"
     lr: 0.001
   sgd-0.01-StepLR:
     optimizer: "sgd"
     lr: 0.01
     scheduler:
       type: "StepLR"
       step_size: 50
       gamma: 0.1
   sgd-0.01-OneCycle:
     optimizer: "sgd"
     lr: 0.01
     scheduler:
       type: "OneCycle"
   sgd-0.01-Poly:
     optimizer: "sgd"
     lr: 0.001
     scheduler:
       type: "Poly"
       exponent: 0.9
 num_epochs: 100
 batch_size: 32

We spoecify each training config as a YAML object. The "sgd" and "adam" optimizers are supported as well as the "StepLR", "OneCycle" and "Poly" schedulers from pyTorch's optim package. All schedulers are compatible with all of the optimizers.

To execute this ImageNet grid search run:

python main.py --experiment_list=configs/experiment_lists/swav.yaml --virb_configs=configs/virb_configs/imagenet_grid_search.yaml

Testing Only Datasets

One aditional feature this codebase supports is datasets that are "eval only" and use a task head trained on a different task. The only currently supported example is ImageNet v2. To test the SWAV 800 model on ImageNetv2 first train at least one ImageNet end task head on SWAV 800 then run the following command:

python main.py --experiment_list=configs/experiment_lists/swav.yaml --virb_configs=configs/virb_configs/imagenetv2.yaml

Custom Models

All the encoders in the tutorials thus far have used the ResNet50 architecture, but we also support using custom encoders.

All of the Image-level tasks require the encoder outputs a dictionary with the key "embedding" mapping to a pyTorch tensor of size NxD where N is the batch size and D is the arbitrary embedding size.

All of the Pixelwise tasks require that the encoders output a dictionary with a tensor for the representation after every block. In practice this means that the model needs to output 5 tensors of sizes corresponding to the outputs of a ResNet50 conv, block1, block2, block3 and block4 layers.

To use a custom model simply modify main.py by replacing ResNet50Encoder with any encoder with the outputs mentioned above.

Citation

@inproceedings{kotar2021contrasting,
  title={Contrasting Contrastive Self-Supervised Representation Learning Pipelines},
  author={Klemen Kotar and Gabriel Ilharco and Ludwig Schmidt and Kiana Ehsani and Roozbeh Mottaghi},
  booktitle={ICCV},  
  year={2021},
}

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

allenai / ViRB

Programming Languages