All Projects → yalesong → Pvse

yalesong / Pvse

Licence: mit
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pvse

symmetrical-synthesis
Official Tensorflow implementation of "Symmetrical Synthesis for Deep Metric Learning" (AAAI 2020)
Stars: ✭ 67 (-27.96%)
Mutual labels:  metric-learning
Deep Metric Learning Baselines
PyTorch Implementation for Deep Metric Learning Pipelines
Stars: ✭ 442 (+375.27%)
Mutual labels:  metric-learning
Open Ucn
The first fully convolutional metric learning for geometric/semantic image correspondences.
Stars: ✭ 60 (-35.48%)
Mutual labels:  metric-learning
Powerful Benchmarker
A PyTorch library for benchmarking deep metric learning. It's powerful.
Stars: ✭ 272 (+192.47%)
Mutual labels:  metric-learning
Hardnet
Hardnet descriptor model - "Working hard to know your neighbor's margins: Local descriptor learning loss"
Stars: ✭ 350 (+276.34%)
Mutual labels:  metric-learning
Additive Margin Softmax
This is the implementation of paper <Additive Margin Softmax for Face Verification>
Stars: ✭ 464 (+398.92%)
Mutual labels:  metric-learning
advrank
Adversarial Ranking Attack and Defense, ECCV, 2020.
Stars: ✭ 19 (-79.57%)
Mutual labels:  metric-learning
Mvgcn
Multi-View Graph Convolutional Network and Its Applications on Neuroimage Analysis for Parkinson's Disease (AMIA 2018)
Stars: ✭ 81 (-12.9%)
Mutual labels:  metric-learning
Survey of deep metric learning
A comprehensive survey of deep metric learning and related works
Stars: ✭ 406 (+336.56%)
Mutual labels:  metric-learning
Hcn Prototypeloss Pytorch
Hierarchical Co-occurrence Network with Prototype Loss for Few-shot Learning (PyTorch)
Stars: ✭ 17 (-81.72%)
Mutual labels:  metric-learning
Pytorch Metric Learning
The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.
Stars: ✭ 3,936 (+4132.26%)
Mutual labels:  metric-learning
Voxceleb trainer
In defence of metric learning for speaker recognition
Stars: ✭ 316 (+239.78%)
Mutual labels:  metric-learning
Humpback Whale Identification 1st
https://www.kaggle.com/c/humpback-whale-identification
Stars: ✭ 591 (+535.48%)
Mutual labels:  metric-learning
Rkd
Official pytorch Implementation of Relational Knowledge Distillation, CVPR 2019
Stars: ✭ 257 (+176.34%)
Mutual labels:  metric-learning
Metric Learn
Metric learning algorithms in Python
Stars: ✭ 1,125 (+1109.68%)
Mutual labels:  metric-learning
disent
🧶 Modular VAE disentanglement framework for python built with PyTorch Lightning ▸ Including metrics and datasets ▸ With strongly supervised, weakly supervised and unsupervised methods ▸ Easily configured and run with Hydra config ▸ Inspired by disentanglement_lib
Stars: ✭ 41 (-55.91%)
Mutual labels:  metric-learning
Amsoftmax
A simple yet effective loss function for face verification.
Stars: ✭ 443 (+376.34%)
Mutual labels:  metric-learning
Pointglr
Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds (CVPR 2020)
Stars: ✭ 86 (-7.53%)
Mutual labels:  metric-learning
Open Reid
Open source person re-identification library in python
Stars: ✭ 1,144 (+1130.11%)
Mutual labels:  metric-learning
Prototypical Networks
Code for the NeurIPS 2017 Paper "Prototypical Networks for Few-shot Learning"
Stars: ✭ 705 (+658.06%)
Mutual labels:  metric-learning

Polysemous Visual-Semantic Embedding (PVSE)

This repository contains a PyTorch implementation of the PVSE network and the MRW dataset proposed in our paper Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019). The code and data are free to use for academic purposes only.

Please also visit our project page

Table of contents

  1. MRW Dataset
  2. Setting up an environment
  3. Download and prepare data
  4. Evaluate pretrained models
  5. Train your own model

MRW Dataset

Our My Reaction When (MRW) dataset contains 50,107 video-sentence pairs crawled from social media, where videos display physical or emotional reactions to the situations described in sentences. This subreddit /r/reactiongifs contains several examples; below shows some representative examples pairs;

(a) Physical Reaction (b) Emotional Reaction (c) Animal Reaction (d) Lexical Reaction
MRW a witty comment I wanted to make was already said MFW I see a cute girl on Facebook change her status to single MFW I cant remember if I've locked my front door MRW a family member askes me why his computer isn't working

Below shows the descriptive statistics of the datset. The word vocabulary size is 34,835. The dataset can be used for evaluting cross-modal retrieval systems under ambiguous/weak-association between vision and language.

Train Validation Test Total
#pairs 44,107 1,000 5,000 50,107
Avg. #frames 104.91 209.04 209.55 117.43
Avg. #words 11.36 15.02 14.79 11.78
Avg. word frequency 15.48 4.80 8.57 16.94

We provide detailed analysis of the dataset in the supplementary material of the main paper.

Follow the instruction below to download the dataset.

Setting up an environment

We recommend creating a virtual environment and install packages there. Note, you must install the Cython package first.

python3 -m venv <your virtual environment name>
source <your virtual environment name>/bin/activate
pip3 install Cython
pip3 install -r requirements.txt

Download and prepare data

MRW

cd data
bash prepare_mrw_dataset.sh

This will download the dataset (without videos) in a JSON format, a vocabulary file, and train/val/test splits. It will then prompt an option:

Do you wish to download video data and gulp them? [y/n]

We provide two ways to obtain the data. A recommended option is to download pre-compiled data in a GulpIO binary storage format, which contains video frames sampled at 8 FPS. For this, simpliy hit n (this will terminate the script) and download our pre-compiled GulpIO data in this link (54 GB). After finish downloading, extract the tarball under data/mrw/gulp to train and/or test our models.

If you wish to download raw video clips and gulp them on your own, hit y once prompted with the message above. This will start downloading videos and, once finished, start gulping the video files at 8 FPS (you can change this in download_gulp_mrw.py). If you encounter any problem downloading the video files, you may also download them directly from this link (19 GB), and then continue gulping them using the script download_gulp_mrw.py.

TGIF

cd data
bash prepare_tgif_dataset.sh

This will download the dataset (without videos) in a TSV format, a vocabulary file, and train/val/test splits. Please note, we use a slightly modified version of the TGIF dataset because of invalid video files; the script will automatically download the modified version.

It will then prompt an option:

Do you wish to gulp the data? [y/n]

Similar to the MRW data, we provide two options to obtain the data: (1) download pre-compiled GulpIO data, or (2) download raw video clips and gulp them on your own, and we recommend the first option for an easy start. For this, simply hit n and download our pre-compiled GulpIO data in this link (89 GB). After finish downloading tgif-gulp.tar.gz, extract the tarball under data/tgif/gulp.

If you wish to gulp your own dataset, hit y and follow the prompt. Note that you must first download a tarball containing the videos before gulping. You can download the file tgif.tar.gz (124 GB) from this link and place it under ./data/tgif. Once you have the video data, the script will start gulping the video files.

MS-COCO

cd data
bash prepare_coco_dataset.sh

Evaluate pretrained models

Download all six pretrained models in a tarball at this link. You can also download each individual files using the links below.

Dataset Model Command
COCO PVSE (k=1) [download] python3 eval.py --data_name coco --num_embeds 1 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse_k1.pth
COCO PVSE [download] python3 eval.py --data_name coco --num_embeds 2 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse.pth
MRW PVSE (k=1) [download] python3 eval.py --data_name mrw --num_embeds 1 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse_k1.pth
MRW PVSE [download] python3 eval.py --data_name mrw --num_embeds 5 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse.pth
TGIF PVSE (k=1) [download] python3 eval.py --data_name tgif --num_embeds 1 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse_k1.pth
TGIF PVSE [download] python3 eval.py --data_name tgif --num_embeds 3 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse.pth

Using the pretrained models you should be able to reproduce the results in the table below

Dataset Model Image/Video-to-Text
[email protected] / [email protected] / [email protected] / Med r (nMR)
Text-to-Image/Video
[email protected] / [email protected] / [email protected] / Med r (nMR)
COCO 1K PVSE (K=1) 66.72 / 91.00 / 96.22 / 1 (0.00) 53.49 / 85.14 / 92.70 / 1 (0.00)
COCO 1K PVSE 69.24 / 91.62 / 96.64 / 1 (0.00) 55.21 / 86.50 / 93.73 / 1 (0.00)
COCO 5K PVSE (K=1) 41.72 / 72.96 / 82.90 / 2 (0.00) 30.64 / 61.37 / 73.62 / 3 (0.00)
COCO 5K PVSE 45.18 / 74.28 / 84.46 / 2 (0.00) 32.42 / 62.97 / 74.96 / 3 (0.00)
MRW PVSE (K=1) 0.16 / 0.68 / 0.90 / 1700 (0.34) 0.16 / 0.56 / 0.88 / 1650 (0.33)
MRW PVSE 0.18 / 0.62 / 1.18 / 1624 (0.32) 0.20 / 0.70 / 1.16 / 1552 (0.31)
TGIF PVSE (K=1) 2.82 / 9.07 / 14.02 / 128 (0.01) 2.63 / 9.37 / 14.58 / 115 (0.01)
TGIF PVSE 3.28 / 9.87 / 15.56 / 115 (0.01) 3.01 / 9.70 / 14.85 / 109 (0.01)

Train your own model

You can train your own model using train.py; check option.py for all available options.

For example, you can train our PVSE model (k=2) on COCO using the command below. It uses ResNet152 as a backbone CNN, GloVe word embedding, MMD loss weight 0.01 and DIV loss weight 0.1, and bacth size of 256:

python3 train.py --data_name coco --cnn_type resnet152 --wemb_type glove --margin 0.1 --max_violation --num_embeds 2 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 256

For video models, you should set the parameter --max_video_length; otherwise it defaults to 1 (single frame). Here's an example command:

python3 train.py --data_name mrw --max_video_length 4 --cnn_type resnet18 --wemb_type glove --margin 0.1 --num_embeds 4 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 128

If you use any of the material in this repository we ask you to cite:

@inproceedings{song-pvse-cvpr19,
  author    = {Yale Song and Mohammad Soleymani},
  title     = {Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval},
  booktitle = {CVPR},
  year      = 2019

Our code is based on the implementation by Faghri et al.

Notes

Last edit: Tuesday July 16, 2019

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].