All Projects → samyak0210 → ViNet

samyak0210 / ViNet

Licence: MIT license
ViNet Pushing the limits of Visual Modality for Audio Visual Saliency Prediction

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
matlab
3953 projects
shell
77523 projects
java
68154 projects - #9 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to ViNet

CheXpert-Challenge
Code for CheXpert Challenge 2019 Top 1 && Top 2 solution
Stars: ✭ 30 (-16.67%)
Mutual labels:  pytorch-implementation
onn
Online Deep Learning: Learning Deep Neural Networks on the Fly / Non-linear Contextual Bandit Algorithm (ONN_THS)
Stars: ✭ 139 (+286.11%)
Mutual labels:  pytorch-implementation
cosine-ood-detector
Hyperparameter-Free Out-of-Distribution Detection Using Softmax of Scaled Cosine Similarity
Stars: ✭ 30 (-16.67%)
Mutual labels:  pytorch-implementation
3D-UNet-PyTorch-Implementation
The implementation of 3D-UNet using PyTorch
Stars: ✭ 78 (+116.67%)
Mutual labels:  pytorch-implementation
NGCF-PyTorch
PyTorch Implementation for Neural Graph Collaborative Filtering
Stars: ✭ 200 (+455.56%)
Mutual labels:  pytorch-implementation
smartImgProcess
手工实现的智能图片处理系统 包含基础的图片处理功能 各类滤波 seam carving算法 以及结合精细语义分割信息 实现智能去除目标的功能
Stars: ✭ 22 (-38.89%)
Mutual labels:  saliency
ResUNetPlusPlus-with-CRF-and-TTA
ResUNet++, CRF, and TTA for segmentation of medical images (IEEE JBIHI)
Stars: ✭ 98 (+172.22%)
Mutual labels:  pytorch-implementation
Generative MLZSL
[TPAMI Under Submission] Generative Multi-Label Zero-Shot Learning
Stars: ✭ 37 (+2.78%)
Mutual labels:  pytorch-implementation
svae cf
[ WSDM '19 ] Sequential Variational Autoencoders for Collaborative Filtering
Stars: ✭ 38 (+5.56%)
Mutual labels:  pytorch-implementation
semi-supervised-paper-implementation
Reproduce some methods in semi-supervised papers.
Stars: ✭ 35 (-2.78%)
Mutual labels:  pytorch-implementation
MolDQN-pytorch
A PyTorch Implementation of "Optimization of Molecules via Deep Reinforcement Learning".
Stars: ✭ 58 (+61.11%)
Mutual labels:  pytorch-implementation
Representation-Learning-for-Information-Extraction
Pytorch implementation of Paper by Google Research - Representation Learning for Information Extraction from Form-like Documents.
Stars: ✭ 82 (+127.78%)
Mutual labels:  pytorch-implementation
MobileHumanPose
This repo is official PyTorch implementation of MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices(CVPRW 2021).
Stars: ✭ 206 (+472.22%)
Mutual labels:  pytorch-implementation
openpose-pytorch
🔥 OpenPose api wrapper in PyTorch.
Stars: ✭ 52 (+44.44%)
Mutual labels:  pytorch-implementation
deep-blueberry
If you've always wanted to learn about deep-learning but don't know where to start, then you might have stumbled upon the right place!
Stars: ✭ 17 (-52.78%)
Mutual labels:  pytorch-implementation
neuro-symbolic-ai-soc
Neuro-Symbolic Visual Question Answering on Sort-of-CLEVR using PyTorch
Stars: ✭ 41 (+13.89%)
Mutual labels:  pytorch-implementation
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (+25%)
Mutual labels:  pytorch-implementation
salt iccv2017
SALT (iccv2017) based Video Denoising Codes, Matlab implementation
Stars: ✭ 26 (-27.78%)
Mutual labels:  state-of-the-art
pytorch-gans
PyTorch implementation of GANs (Generative Adversarial Networks). DCGAN, Pix2Pix, CycleGAN, SRGAN
Stars: ✭ 21 (-41.67%)
Mutual labels:  pytorch-implementation
RandLA-Net-pytorch
🍀 Pytorch Implementation of RandLA-Net (https://arxiv.org/abs/1911.11236)
Stars: ✭ 69 (+91.67%)
Mutual labels:  pytorch-implementation

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

This repository contains Pytorch Implementation of ViNet and AViNet.

PWC

Cite

Please cite with the following Bibtex code:

@misc{jain2021vinet,
      title={ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction}, 
      author={Samyak Jain and Pradeep Yarlagadda and Shreyank Jyoti and Shyamgopal Karthik and Ramanathan Subramanian and Vineet Gandhi},
      year={2021},
      eprint={2012.06170},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Abstract

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner.

Examples

We compare our model ViNet with UNISAL (previous state-of-the-art). Below are some examples of our model. The first section is original video, second is the ground-truth, third is our model's prediction and last is the UNISAL's prediction.

Architecture

Dataset

  • DHF1K and UCF Sports dataset can be downloaded from this link.
  • Hollywood-2 dataset can be downloaded from this link
  • The six audio-visual datasets - DIEM, AVAD, Coutrot-1&2, SumMe and ETMD can be downloaded from this link. You can also run the following command to fetch all the six dataset and its components -
$ bash fetch_data.sh

Testing

Clone this repository and download the pretrained weights of AViNet and ViNet on multiple datasets from this link.

  • ViNet

Run the code using

$ python3 generate_result.py --path_indata path/to/test/frames --save_path path/to/results --file_weight path/to/saved/models

This will generate saliency maps for all frames in the directory and dump these maps into results directory. The directory structure should be

└── Dataset  
    ├── Video-Number  
        ├── images  
  • AViNet

Run the code using

$ python3 generate_result_audio_visual.py --path_indata path/to/test/frames --save_path path/to/results --file_weight path/to/saved/models --use_sound True --split <split_number>
<split_number>: {1,2,3}

This will generate saliency maps for all frames in the directory and dump these maps into results directory. The directory structure should be

└── Dataset  
    ├── video_frames  
        ├── <dataset_name>
            ├── Video-Name
                ├── frames
    ├── video_audio  
        ├── <dataset_name>
            ├── Video-Name
                ├── audio  
    ├── fold_lists
        ├── <dataset_file>.txt

Fold_lists consists of text file of video names and their corresponding fps in various splits. The directory structure is the same as the one generated by the fetch_data.sh file.

Training

For training the model from scratch, download the pretrained weights of S3D from here and place these weights in the same directory. Run the following command to train

$ python3 train.py --train_path_data path/to/train/dataset --val_path_data  path/to/val/dataset --dataset <dataset_name> --use_sound <boolean_value>
<dataset_name> : {"DHF1KDataset", "SoundDataset", "Hollywood", "UCF"} 

In case of ViNet, the dataset directory structure should be

└── Dataset  
    ├── Video-Number  
        ├── images  
        |── maps
        └── fixations  

In case of AViNet, the dataset directory structure should be

└── Dataset  
    ├── video_frames  
        ├── <dataset_name>
            ├── Video-Name
                ├── frames
    ├── video_audio  
        ├── <dataset_name>
            ├── Video-Name
                ├── audio
    ├── annotations
        ├── <dataset_name>
            ├── Video-Name
                ├── <frame_id>.mat (fixations)
                ├── maps
                    ├── <frame_id>.jpg (ground truth saliency map)
    ├── fold_lists
        ├── <dataset_file>.txt

For training the ViNet with Hollywood-2 or UCF-Sports dataset, first train the model with DHF1K dataset or you can directly use our model trained on DHF1K, and finetune the model weights on aforementioned datasets.

Similarly for training the AViNet with DIEM, AVAD, Coutrot-1&2, ETMD and SumMe dataset, first load model with DHF1K trained weights and finetune the model weights on aforementioned datasets.

Experiments

  • Audio

For training the model, we provide argument to select the model between ViNet (Visual Net) and AViNet (Audio-Visual Net). Run the command -

$ python3 train.py --use_sound <boolean_value> 

If you want to save the results of the generated map run the command -

$ python3 generate_result_audio_visual.py --use_sound <boolean_value> --file_weight <path/to/file> --path_indata <path/to/data> 
  • Multiple Audio-Visual Fusion

You can select the corresponding fusion technique's model from the model.py file. By default it uses the model with Bilinear concatenation for fusing audio and visual cues. If you want to use Transformer-based fusion technique, call the model with name VideoAudioSaliencyTransformerFusionModel.

Quantitative Results

  • DHF1K

The results of our models on DHF1K test dataset can be viewed here under the name ViNet. Comparison with other state-of-the-art saliency detection models

  • Hollywood2

  • UCF

  • DIEM

We provide results on both our models - ViNet and AViNet on the DIEM dataset.

Contact

If any question, please contact [email protected], or [email protected] , or use public issues section of this repository

License

This code is distributed under MIT LICENSE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].