Hierarchical Conditional Relation Networks for Video Question Answering (HCRN-VideoQA)

We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that encapsulates and transforms an array of tensorial objects into a new array of the same kind, conditioned on a contextual feature. The flexibility of CRN units is then examined in solving Video Question Answering, a challenging problem requiring joint comprehension of video content and natural language processing.

Illustrations of CRN unit and the result of model building HCNR for VideoQA:

CRN Unit	HCRN Architecture

Check out our paper for details.

Setups

Clone the repository:

 git clone https://github.com/thaolmk54/hcrn-videoqa.git

Download TGIF-QA, MSRVTT-QA, MSVD-QA dataset and edit absolute paths in preprocess/preprocess_features.py and preprocess/preprocess_questions.py upon where you locate your data. Default paths are with /ceph-g/lethao/datasets/{dataset_name}/.
Install dependencies:

conda create -n hcrn_videoqa python=3.6
conda activate hcrn_videoqa
conda install -c conda-forge ffmpeg
conda install -c conda-forge scikit-video
pip install -r requirements.txt

Experiments with TGIF-QA

Depending on the task to chose question_type out of 4 options: action, transition, count, frameqa.

Preprocessing visual features

To extract appearance feature:

python preprocess/preprocess_features.py --gpu_id 2 --dataset tgif-qa --model resnet101 --question_type {question_type}

To extract motion feature:

Download ResNeXt-101 pretrained model (resnext-101-kinetics.pth) and place it to data/preprocess/pretrained/.

python preprocess/preprocess_features.py --dataset tgif-qa --model resnext101 --image_height 112 --image_width 112 --question_type {question_type}

Note: Extracting visual feature takes a long time. You can download our pre-extracted features from here and save them in data/tgif-qa/{question_type}/. Please use the following command to join split files:

cat tgif-qa_{question_type}_appearance_feat.h5.part* > tgif-qa_{question_type}_appearance_feat.h5

Proprocess linguistic features

Download glove pretrained 300d word vectors to data/glove/ and process it into a pickle file:

python txt2pickle.py

Preprocess train/val/test questions:

python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --glove_pt data/glove/glove.840.300d.pkl --mode train
    
python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --mode test

Training

Choose a suitable config file in configs/{task}.yml for one of 4 tasks: action, transition, count, frameqa to train the model. For example, to train with action task, run the following command:

python train.py --cfg configs/tgif_qa_action.yml

Evaluation

To evaluate the trained model, run the following:

python validate.py --cfg configs/tgif_qa_action.yml

Note: Pretrained model for action task is available here. Save the file in results/expTGIF-QAAction/ckpt/ for evaluation.

Experiments with MSRVTT-QA and MSVD-QA

The following is to run experiments with MSRVTT-QA dataset, replace msrvtt-qa with msvd-qa to run with MSVD-QA dataset.

Preprocessing visual features

To extract appearance feature:

python preprocess/preprocess_features.py --gpu_id 2 --dataset msrvtt-qa --model resnet101

To extract motion feature:

python preprocess/preprocess_features.py --dataset msrvtt-qa --model resnext101 --image_height 112 --image_width 112

Proprocess linguistic features

Preprocess train/val/test questions:

python preprocess/preprocess_questions.py --dataset msrvtt-qa --glove_pt data/glove/glove.840.300d.pkl --mode train
    
python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode val
    
python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode test

Training

python train.py --cfg configs/msrvtt_qa.yml

Evaluation

To evaluate the trained model, run the following:

python validate.py --cfg configs/msrvtt_qa.yml

Citations

If you make use of this repository for your research, please cite the following paper:

@article{le2020hierarchical,
  title={Hierarchical Conditional Relation Networks for Video Question Answering},
  author={Le, Thao Minh and Le, Vuong and Venkatesh, Svetha and Tran, Truyen},
  journal={arXiv preprint arXiv:2002.10698},
  year={2020}
}

Acknowledgement

As for motion feature extraction, we adapt ResNeXt-101 model from this repo to our code. Thank @kenshohara for releasing the code and the pretrained models.
We refer to this repo for preprocessing.
Our implementation of dataloader is based on this repo.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

thaolmk54 / hcrn-videoqa

Programming Languages

Labels

Projects that are alternatives of or similar to hcrn-videoqa

Hierarchical Conditional Relation Networks for Video Question Answering (HCRN-VideoQA)

Setups

Experiments with TGIF-QA

Preprocessing visual features

Proprocess linguistic features

Training

Evaluation

Experiments with MSRVTT-QA and MSVD-QA

Preprocessing visual features

Proprocess linguistic features

Training

Evaluation

Citations

Acknowledgement