All Projects → hengyuan-hu → Bottom Up Attention Vqa

hengyuan-hu / Bottom Up Attention Vqa

Licence: gpl-3.0
An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bottom Up Attention Vqa

Transformer-MM-Explainability
[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
Stars: ✭ 484 (-27.44%)
Mutual labels:  vqa
rosita
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Stars: ✭ 36 (-94.6%)
Mutual labels:  vqa
Oscar
Oscar and VinVL
Stars: ✭ 396 (-40.63%)
Mutual labels:  vqa
probnmn-clevr
Code for ICML 2019 paper "Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering" [long-oral]
Stars: ✭ 63 (-90.55%)
Mutual labels:  vqa
FigureQA-baseline
TensorFlow implementation of the CNN-LSTM, Relation Network and text-only baselines for the paper "FigureQA: An Annotated Figure Dataset for Visual Reasoning"
Stars: ✭ 28 (-95.8%)
Mutual labels:  vqa
MICCAI21 MMQ
Multiple Meta-model Quantifying for Medical Visual Question Answering
Stars: ✭ 16 (-97.6%)
Mutual labels:  vqa
hcrn-videoqa
Implementation for the paper "Hierarchical Conditional Relation Networks for Video Question Answering" (Le et al., CVPR 2020, Oral)
Stars: ✭ 111 (-83.36%)
Mutual labels:  vqa
Mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
Stars: ✭ 4,713 (+606.6%)
Mutual labels:  vqa
vqa-soft
Accompanying code for "A Simple Loss Function for Improving the Convergence and Accuracy of Visual Question Answering Models" CVPR 2017 VQA workshop paper.
Stars: ✭ 14 (-97.9%)
Mutual labels:  vqa
Tbd Nets
PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"
Stars: ✭ 345 (-48.28%)
Mutual labels:  vqa
AoA-pytorch
A Pytorch implementation of Attention on Attention module (both self and guided variants), for Visual Question Answering
Stars: ✭ 33 (-95.05%)
Mutual labels:  vqa
DVQA dataset
DVQA Dataset: A Bar chart question answering dataset presented at CVPR 2018
Stars: ✭ 20 (-97%)
Mutual labels:  vqa
Nscl Pytorch Release
PyTorch implementation for the Neuro-Symbolic Concept Learner (NS-CL).
Stars: ✭ 276 (-58.62%)
Mutual labels:  vqa
iMIX
A framework for Multimodal Intelligence research from Inspur HSSLAB.
Stars: ✭ 21 (-96.85%)
Mutual labels:  vqa
Awesome Vqa
Visual Q&A reading list
Stars: ✭ 403 (-39.58%)
Mutual labels:  vqa
mmgnn textvqa
A Pytorch implementation of CVPR 2020 paper: Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text
Stars: ✭ 41 (-93.85%)
Mutual labels:  vqa
bottom-up-features
Bottom-up features extractor implemented in PyTorch.
Stars: ✭ 62 (-90.7%)
Mutual labels:  vqa
Vqa.pytorch
Visual Question Answering in Pytorch
Stars: ✭ 602 (-9.75%)
Mutual labels:  vqa
Mac Network
Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)
Stars: ✭ 444 (-33.43%)
Mutual labels:  vqa
Awesome Visual Question Answering
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
Stars: ✭ 295 (-55.77%)
Mutual labels:  vqa

Bottom-Up and Top-Down Attention for Visual Question Answering

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

The implementation follows the VQA system described in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering" (https://arxiv.org/abs/1707.07998) and "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge" (https://arxiv.org/abs/1708.02711).

Results

Model Validation Accuracy Training Time
Reported Model 63.15 12 - 18 hours (Tesla K40)
Implemented Model 63.58 40 - 50 minutes (Titan Xp)

The accuracy was calculated using the VQA evaluation metric.

About

This is part of a project done at CMU for the course 11-777 Advanced Multimodal Machine Learning and a joint work between Hengyuan Hu, Alex Xiao, and Henry Huang.

As part of our project, we implemented bottom up attention as a strong VQA baseline. We were planning to integrate object detection with VQA and were very glad to see that Peter Anderson and Damien Teney et al. had already done that beautifully. We hope this clean and efficient implementation can serve as a useful baseline for future VQA explorations.

Implementation Details

Our implementation follows the overall structure of the papers but with the following simplifications:

  1. We don't use extra data from Visual Genome.
  2. We use only a fixed number of objects per image (K=36).
  3. We use a simple, single stream classifier without pre-training.
  4. We use the simple ReLU activation instead of gated tanh.

The first two points greatly reduce the training time. Our implementation takes around 200 seconds per epoch on a single Titan Xp while the one described in the paper takes 1 hour per epoch.

The third point is simply because we feel the two stream classifier and pre-training in the original paper is over-complicated and not necessary.

For the non-linear activation unit, we tried gated tanh but couldn't make it work. We also tried gated linear unit (GLU) and it works better than ReLU. Eventually we choose ReLU due to its simplicity and since the gain from using GLU is too small to justify the fact that GLU doubles the number of parameters.

With these simplifications we would expect the performance to drop. For reference, the best result on validation set reported in the paper is 63.15. The reported result without extra data from visual genome is 62.48, the result using only 36 objects per image is 62.82, the result using two steam classifier but not pre-trained is 62.28 and the result using ReLU is 61.63. These numbers are cited from the Table 1 of the paper: "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge". With all the above simplification aggregated, our first implementation got around 59-60 on validation set.

To shrink the gap, we added some simple but powerful modifications. Including:

  1. Add dropout to alleviate overfitting
  2. Double the number of neurons
  3. Add weight normalization (BN seems not work well here)
  4. Switch to Adamax optimizer
  5. Gradient clipping

These small modifications bring the number back to ~62.80. We further change the concatenation based attention module in the original paper to a projection based module. This new attention module is inspired by the paper "Modeling Relationships in Referential Expressions with Compositional Modular Networks" (https://arxiv.org/pdf/1611.09978.pdf), but with some modifications (implemented in attention.NewAttention). With the help of this new attention, we boost the performance to ~63.58, surpassing the reported best result with no extra data and less computation cost.

Usage

Prerequisites

Make sure you are on a machine with a NVIDIA GPU and Python 2 with about 70 GB disk space.

  1. Install PyTorch v0.3 with CUDA and Python 2.7.
  2. Install h5py.

Data Setup

All data should be downloaded to a 'data/' directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. The features are provided by and downloaded from the original authors' repo. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

Training

Simply run python main.py to start training. The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default flags should give you the result provided in the table above.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].