All Projects → doc-doc → NExT-QA

doc-doc / NExT-QA

Licence: MIT license
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to NExT-QA

just-ask
[TPAMI Special Issue on ICCV 2021 Best Papers, Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Stars: ✭ 57 (+14%)
Mutual labels:  video-understanding, videoqa, video-question-answering
SSTDA
[CVPR 2020] Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation (PyTorch)
Stars: ✭ 150 (+200%)
Mutual labels:  video-understanding
Temporal Shift Module
[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
Stars: ✭ 1,282 (+2464%)
Mutual labels:  video-understanding
Object level visual reasoning
Pytorch Implementation of "Object level Visual Reasoning in Videos", F. Baradel, N. Neverova, C. Wolf, J. Mille, G. Mori , ECCV 2018
Stars: ✭ 163 (+226%)
Mutual labels:  video-understanding
Movienet Tools
Tools for movie and video research
Stars: ✭ 113 (+126%)
Mutual labels:  video-understanding
Step
STEP: Spatio-Temporal Progressive Learning for Video Action Detection. CVPR'19 (Oral)
Stars: ✭ 196 (+292%)
Mutual labels:  video-understanding
Tdn
[CVPR 2021] TDN: Temporal Difference Networks for Efficient Action Recognition
Stars: ✭ 72 (+44%)
Mutual labels:  video-understanding
glimpse clouds
Pytorch implementation of the paper "Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points", F. Baradel, C. Wolf, J. Mille , G.W. Taylor, CVPR 2018
Stars: ✭ 30 (-40%)
Mutual labels:  video-understanding
Awesome Grounding
awesome grounding: A curated list of research papers in visual grounding
Stars: ✭ 247 (+394%)
Mutual labels:  video-understanding
Awesome Activity Prediction
Paper list of activity prediction and related area
Stars: ✭ 147 (+194%)
Mutual labels:  video-understanding
Video2tfrecord
Easily convert RGB video data (e.g. .avi) to the TensorFlow tfrecords file format for training e.g. a NN in TensorFlow. This implementation allows to limit the number of frames per video to be stored in the tfrecords.
Stars: ✭ 137 (+174%)
Mutual labels:  video-understanding
I3d finetune
TensorFlow code for finetuning I3D model on UCF101.
Stars: ✭ 128 (+156%)
Mutual labels:  video-understanding
Actionvlad
ActionVLAD for video action classification (CVPR 2017)
Stars: ✭ 217 (+334%)
Mutual labels:  video-understanding
Temporal Segment Networks
Code & Models for Temporal Segment Networks (TSN) in ECCV 2016
Stars: ✭ 1,287 (+2474%)
Mutual labels:  video-understanding
Kaleido-BERT
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.
Stars: ✭ 252 (+404%)
Mutual labels:  vision-language
Temporally Language Grounding
A Pytorch implemention for some state-of-the-art models for" Temporally Language Grounding in Untrimmed Videos"
Stars: ✭ 73 (+46%)
Mutual labels:  video-understanding
Multiverse
Dataset, code and model for the CVPR'20 paper "The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction". And for the ECCV'20 SimAug paper.
Stars: ✭ 131 (+162%)
Mutual labels:  video-understanding
Youtube 8m
The 2nd place Solution to the Youtube-8M Video Understanding Challenge by Team Monkeytyping (based on tensorflow)
Stars: ✭ 171 (+242%)
Mutual labels:  video-understanding
pytorch violet
A PyTorch implementation of VIOLET
Stars: ✭ 119 (+138%)
Mutual labels:  video-question-answering
hcrn-videoqa
Implementation for the paper "Hierarchical Conditional Relation Networks for Video Question Answering" (Le et al., CVPR 2020, Oral)
Stars: ✭ 111 (+122%)
Mutual labels:  videoqa

NExT-QA

We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021. NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. We set up both multi-choice and open-ended QA tasks on the dataset. This repo. provides resources for multi-choice QA; open-ended QA is found in NExT-OE. For more details, please refer to our dataset page.

Todo

  1. Relation annotations are available.
  2. Open evaluation server and release test data.
  3. Release spatial feature (valid for 2 weeks from 2020/9/23).
  4. Release RoI feature(tensor shape (vid_num, clip_per_vid, frame_per_clip, region_per_frame, feat_dim).

Environment

Anaconda 4.8.4, python 3.6.8, pytorch 1.6 and cuda 10.2. For other libs, please refer to the file requirements.txt.

Install

Please create an env for this project using anaconda (should install anaconda first)

>conda create -n videoqa python==3.6.8
>conda activate videoqa
>git clone https://github.com/doc-doc/NExT-QA.git
>pip install -r requirements.txt #may take some time to install

Data Preparation

Please download the pre-computed features and QA annotations from here. There are 4 zip files:

  • ['vid_feat.zip']: Appearance and motion feature for video representation. (With code provided by HCRN).
  • ['qas_bert.zip']: Finetuned BERT feature for QA-pair representation. (Based on pytorch-pretrained-BERT).
  • ['nextqa.zip']: Annotations of QAs and GloVe Embeddings.
  • ['models.zip']: HGA model.

After downloading the data, please create a folder ['data/feats'] at the same directory as ['NExT-QA'], then unzip the video and QA features into it. You will have directories like ['data/feats/vid_feat/', 'data/feats/qas_bert/' and 'NExT-QA/'] in your workspace. Please unzip the files in ['nextqa.zip'] into ['NExT-QA/dataset/nextqa'] and ['models.zip'] into ['NExT-QA/models/'].

(You are also encouraged to design your own pre-computed video features. In that case, please download the raw videos from VidOR+missing videos. As NExT-QA's videos are sourced from VidOR, you can easily link the QA annotations with the corresponding videos according to the key 'video' in the ['nextqa/.csv'] files, during which you may need the map file ['nextqa/map_vid_vidorID.json']).

Usage

Once the data is ready, you can easily run the code. First, to test the environment and code, we provide the prediction and model of the SOTA approach (i.e., HGA) on NExT-QA. You can get the results reported in the paper by running:

>python eval_mc.py

The command above will load the prediction file under ['results/'] and evaluate it. You can also obtain the prediction by running:

>./main.sh 0 val #Test the model with GPU id 0

The command above will load the model under ['models/'] and generate the prediction file. If you want to train the model, please run

>./main.sh 0 train # Train the model with GPU id 0

It will train the model and save to ['models']. (The results may be slightly different depending on the environments)

Results on Val. Set

Methods Text Rep. Acc@C Acc@T Acc@D Acc Text Rep. Acc@C Acc@T Acc@D Acc
BlindQA GloVe 26.89 30.83 32.60 30.60 BERT-FT 42.62 45.53 43.89 43.76
EVQA GloVe 28.69 31.27 41.44 31.51 BERT-FT 42.64 46.34 45.82 44.24
STVQA (CVPR17) GloVe 36.25 36.29 55.21 39.21 BERT-FT 44.76 49.26 55.86 47.94
CoMem (CVPR18) GloVe 35.10 37.28 50.45 38.19 BERT-FT 45.22 49.07 55.34 48.04
HME (CVPR19) GloVe 37.97 36.91 51.87 39.79 BERT-FT 46.18 48.20 58.30 48.72
HCRN (CVPR20) GloVe 39.09 40.01 49.16 40.95 BERT-FT 45.91 49.26 53.67 48.20
HGA (AAAI20) GloVe 35.71 38.40 55.60 39.67 BERT-FT 46.26 50.74 59.33 49.74
Human - 87.61 88.56 90.40 88.38 - 87.61 88.56 90.40 88.38

(For comparison, please refer to the results on val/test sets in our paper.)

Multi-choice QA vs. Open-ended QA

vis mc_oe

Some Latest Results

Table 2. VideoQA Accuracy (%).

Methods Publication Highlights Val Test
IGV CVPR'22 Invariant Grounding - 51.34
HQGA AAAI'22 Multi-Granularity 51.42 51.75
P3D-G AAAI'22 Pesudo 3D Scene Graph 53.4 -
ATP CVPR'22 CLIP & Frame Bias 54.3 -
VGT ECCV'22 Graph Transformer/(Pretrain) 55.02/(56.89) 53.68/(55.7)
HGA+EIGV ACM MM'22 Equivariant and Invariant Grounding - 53.7

Citation

@InProceedings{xiao2021next,
    author    = {Xiao, Junbin and Shang, Xindi and Yao, Angela and Chua, Tat-Seng},
    title     = {NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {9777-9786}
}

Acknowledgement

Our reproduction of the methods is based on the respective official repositories, we thank the authors to release their code. If you use the related part, please cite the corresponding paper commented in the code.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].