All Projects → facebookresearch → Listen-to-Look

facebookresearch / Listen-to-Look

Licence: CC-BY-4.0 license
Listen to Look: Action Recognition by Previewing Audio (CVPR 2020)

Programming Languages

python
139335 projects - #7 most used programming language

Listen to Look: Action Recognition by Previewing Audio (CVPR 2020)

[Project Page] [arXiv]


Listen to Look: Action Recognition by Previewing Audio
Ruohan Gao1,2, Tae-Hyun Oh2, Kristen Grauman1,2, Lorenzo Torresani2
1UT Austin, 2Facebook AI Research
In Conference on Computer Vision and Pattern Recognition (CVPR), 2020


If you find our code or project useful in your research, please cite:

@inproceedings{gao2020listentolook,
  title = {Listen to Look: Action Recognition by Previewing Audio},
  author = {Gao, Ruohan and Oh, Tae-Hyun, and Grauman, Kristen and Torresani, Lorenzo},
  booktitle = {CVPR},
  year = {2020}
}

Preparation

The image features, audio features, and image-audio features for ActivityNet are shared at this link. After IMGAUD2VID distillation on Kinetics, we fine-tune the image-audio network for action classification on AcitivityNet. The image features, audio features, and image-audio features after the fusion layer (see Fig.2 in the paper) are extracted from the fine-tuned image-audio network. The image-audio model fine-tuned on ActivityNet and the pickle files of the paths to the image-audio features are also shared. The features can also be downloaded using the commands below:

wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/image_features.tar.gz
wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/audio_features.tar.gz
wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/imageAudio_features.tar.gz

Training and Testing

(The code has beed tested under the following system environment: Ubuntu 18.04.3 LTS, CUDA 10.0, Python 3.7.3, PyTorch 1.0.1)

  1. Download the extracted features and the fine-tuned image-audio model for ActivityNet, and prepare the pickle file accordingly by changing to paths to have the correct root prefix of your own.

  2. Use the following command to train the video preview model:

python main.py \
--train_dataset_file '/your_pickle_file_root_path/train.pkl' \
--test_dataset_file '/your_pickle_file_root_path/val.pkl' \
--batch_size 256 \
--warmup_epochs 0 \
--epochs 25 \
--lr 0.01 \
--milestones 15 20 \
--momentum 0.9 \
--decode_threads 10 \
--scheduler \
--num_classes 200 \
--weights_audioImageModel '/your_model_root_path/ImageAudioNet_ActivityNet.pth' \
--checkpoint_freq 10 \
--episode_length 10 \
--checkpoint_path './checkpoints/exp' \
--freeze_imageAudioNet \
--with_avgpool_ce_loss \
--compute_mAP \
--mean_feature_as_start \
--subsample_factor 1 \
--with_replacement |& tee -a logs/exp.log
  1. Use the following command to test your trained model:
python validate.py \
--test_dataset_file '/your_pickle_file_root_path/val.pkl' \
--batch_size 256 \
--decode_threads 10 \
--scheduler \
--num_classes 200 \
--episode_length 10 \
--pretrained_model './checkpoints/exp/model_final.pth' \
--with_replacement \
--mean_feature_as_start \
--feature_interpolate \
--subsample_factor 1 \
--compute_mAP
  1. The single modality variant of our model is shared under listen_to_look_single_modality. The r2plus1d152 features for ActivityNet can be downloaded using the command below:
wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/r2plus1d152_features.tar.gz

Acknowlegements

Portions of the code are borrowed or adapted from Bruno Korbar and Zuxuan Wu. Thanks for their help!

Licence

The code for Listen to Look is CC BY 4.0 licensed, as found in the LICENSE file.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].