All Projects → TheShadow29 → VidSitu

TheShadow29 / VidSitu

Licence: MIT license
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to VidSitu

calvin
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Stars: ✭ 105 (+156.1%)
Mutual labels:  vision, vision-and-language, grounding
iPerceive
Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | Python3 | PyTorch | CNNs | Causality | Reasoning | LSTMs | Transformers | Multi-Head Self Attention | Published in IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
Stars: ✭ 52 (+26.83%)
Mutual labels:  captioning, captioning-videos
CarLens-iOS
CarLens - Recognize and Collect Cars
Stars: ✭ 124 (+202.44%)
Mutual labels:  vision
wikiHow paper list
A paper list of research conducted based on wikiHow
Stars: ✭ 25 (-39.02%)
Mutual labels:  vision-and-language
mediapipe plus
The purpose of this project is to apply mediapipe to more AI chips.
Stars: ✭ 38 (-7.32%)
Mutual labels:  vision
SAPC-APCA
APCA (Accessible Perceptual Contrast Algorithm) is a new method for predicting contrast for use in emerging web standards (WCAG 3) for determining readability contrast. APCA is derived form the SAPC (S-LUV Advanced Predictive Color) which is an accessibility-oriented color appearance model designed for self-illuminated displays.
Stars: ✭ 266 (+548.78%)
Mutual labels:  vision
TRAR-VQA
[ICCV 2021] TRAR: Routing the Attention Spans in Transformers for Visual Question Answering -- Official Implementation
Stars: ✭ 49 (+19.51%)
Mutual labels:  vision-and-language
dd-ml-segmentation-benchmark
DroneDeploy Machine Learning Segmentation Benchmark
Stars: ✭ 179 (+336.59%)
Mutual labels:  vision
flutter-vision
iOS and Android app built with Flutter and Firebase. Includes Firebase ML Vision, Firestore, and Storage
Stars: ✭ 45 (+9.76%)
Mutual labels:  vision
mlp-mixer-pytorch
An All-MLP solution for Vision, from Google AI
Stars: ✭ 771 (+1780.49%)
Mutual labels:  vision
non-contact-sleep-apnea-detection
Gihan Jayatilaka, Harshana Weligampola, Suren Sritharan, Pankayaraj Pathmanathan, Roshan Ragel and Isuru Nawinne, "Non-contact Infant Sleep Apnea Detection," 2019 14th Conference on Industrial and Information Systems (ICIIS), Kandy, Sri Lanka, 2019, pp. 260-265, doi: 10.1109/ICIIS47346.2019.9063269.
Stars: ✭ 15 (-63.41%)
Mutual labels:  vision
ebu-tt-live-toolkit
Toolkit for supporting the EBU-TT Live specification
Stars: ✭ 23 (-43.9%)
Mutual labels:  captioning
S2VT-seq2seq-video-captioning-attention
S2VT (seq2seq) video captioning with bahdanau & luong attention implementation in Tensorflow
Stars: ✭ 18 (-56.1%)
Mutual labels:  captioning
DonkeyDrift
Open-source self-driving car based on DonkeyCar and programmable chassis
Stars: ✭ 15 (-63.41%)
Mutual labels:  vision
CBP
Official Tensorflow Implementation of the AAAI-2020 paper "Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction"
Stars: ✭ 52 (+26.83%)
Mutual labels:  vision-and-language
Final-year-project-deep-learning-models
Deep learning for freehand sketch object recognition
Stars: ✭ 22 (-46.34%)
Mutual labels:  vision
edge-computer-vision
Edge Computer Vision Course
Stars: ✭ 41 (+0%)
Mutual labels:  vision
FaceData
A macOS app to parse face landmarks from a video for GANs training
Stars: ✭ 71 (+73.17%)
Mutual labels:  vision
SemanticSegmentation-Libtorch
Libtorch Examples
Stars: ✭ 38 (-7.32%)
Mutual labels:  vision
Vision
Computer Vision And Neural Network with Xamarin
Stars: ✭ 54 (+31.71%)
Mutual labels:  vision

Visual Semantic Role Labeling for Video Understanding (CVPR21)

LICENSE Python PyTorch Arxiv

Visual Semantic Role Labeling for Video Understanding
Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, Aniruddha Kembhavi
CVPR 2021

VidSitu is a large-scale dataset containing diverse 10-second videos from movies depicting complex situations (a collection of related events). Events in the video are richly annotated at 2-second intervals with verbs, semantic-roles, entity co-references, and event relations.

This repository includes:

  1. Instructions to install, download and process VidSitu Dataset.
  2. Code to run all experiments provided in the paper along with log files.
  3. Instructions to submit results to the Leaderboard.

Download

Please see DATA_PREP.md for detailed instructions on downloading and setting up the dataset.

Installation

Please see INSTALL.md for detailed instructions

Training

  • Basic usage is CUDA_VISIBLE_DEVICES=$GPUS python main_dist.py "experiment_name" --arg1=val1 --arg2=val2 and the arg1, arg2 can be found in configs/vsitu_cfg.yml.

  • Set $GPUS=0 for single gpu training. For multi-gpu training via Pytorch Distributed Data Parallel use $GPUS=0,1,2,3

  • YML has a hierarchical structure which is supported using . For instance, if you want to change the beam_size under gen which in the YML file looks like

    gen:
        beam_size: 1
    

    you can pass --gen.beam_size=5

  • Sometimes it might be easier to directly change the default setting in configs/vsitu_cfg.yml itself.

  • To keep the code modular, some configurations are set in code/extended_config.py as well.

  • All model choices are available under code/mdl_selector.py

See EXPTS.md for detailed usage and reproducing numbers in the paper.

Logging

Logs are stored inside tmp/ directory. When you run the code with $exp_name the following are stored:

  • txt_logs/$exp_name.txt: the config used and the training, validation losses after ever epoch.
  • models/$exp_name.pth: the model, optimizer, scheduler, accuracy, number of epochs and iterations completed are stored. Only the best model upto the current epoch is stored.
  • ext_logs/$exp_name.txt: this uses the logging module of python to store the logger.debug outputs printed. Mainly used for debugging.
  • predictions: the validation outputs of current best model.

Logs are also stored using MLFlow. These can be uploaded to other experiment trackers such as neptune.ai, wandb for better visualization of results.

Evaluation (Locally)

  1. Evaluation scripts are available for the three tasks under code/evl_fns.py. The same file is used for leaderboard purposes. If you are using this codebase, the predictions are stored under tmp/predictions/{expt_id}/valid_0.pkl. You can evaluate using the following command:

    python code/eval_fns.py --pred_file='./tmp/predictions/{expt_id}/valid_0.pkl' --split_type='valid' --task_type=$TASK
    

    Here $TASK can be vb, vb_arg, evrel corresponding to Verb Prediction, Semantic Role Prediction and Event Relation Prediction

  2. The output format for the files are as follows:

    1. Verb Prediction:

      List[Dict]
      Dict:
          # Both lists of length 5. Outer list denotes Events 1-5, inner list denotes Top-5 VerbID predictions
          pred_vbs_ev: List[List[str]]
          # Both lists of length 5. Outer list denotes Events 1-5, inner list denotes the scores for the Top-5 VerbID predictions
          pred_scores_ev: List[List[float]]
          #the index of the video segment used. Corresponds to the number in {valid|test}_split_file.json
          ann_idx: int
      
    2. Semantic Role Labeling Prediction

      List[Dict]
      Dict:
          # same as above
          ann_idx: int
          # The main output used for evaluation. Outer Dict is for Events 1-5.
          vb_output: Dict[Dict]
          # The inner dict has the following keys:
              # VerbID of the event
              vb_id: str
              ArgX: str
              ArgY: str
              ...
      

      Note that ArgX, ArgY depend on the specific VerbID

    3. Event Relation Prediction

      List[Dict]
      Dict:
          # same as above
          ann_idx: int
          # Ouuter list of length 4 and denotes Event Relation {1-3, 2-3, 3-4, 4-5}. Inner list denotes three Event Relations for given Verb+Semantic Role Inputs
          pred_evrels_ev: List[List[str]]
          # Scores for the above
          pred_scores_ev: List[List[float]]
      

    See examples under docs

Leaderboard (Evaluation on Test Sets)

We maintain three separate leaderboards for each of the three tasks. The leaderboard will accept submissions from April 7th, 2021. The output format remains the same as local evaluation.

Here are the leaderboard links:

Citation

@InProceedings{Sadhu_2021_CVPR,
          author = {Sadhu, Arka and Gupta, Tanmay and Yatskar, Mark and Nevatia, Ram and Kembhavi, Aniruddha},
          title = {Visual Semantic Role Labeling for Video Understanding},
          booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
          month = {June},
          year = {2021}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].