Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.

Stars: ✭ 295 (+92.81%)

Mutual labels: vqa

Mac Network

Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)

Stars: ✭ 444 (+190.2%)

Mutual labels: vqa

Visual Question Answering

📷 ❓ Visual Question Answering Demo and Algorithmia API

Stars: ✭ 18 (-88.24%)

Mutual labels: vqa

rosita

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Stars: ✭ 36 (-76.47%)

Mutual labels: vqa

Papers

读过的CV方向的一些论文，图像生成文字、弱监督分割等

Stars: ✭ 99 (-35.29%)

Mutual labels: vqa

Mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Stars: ✭ 4,713 (+2980.39%)

Mutual labels: vqa

Conditional Batch Norm

Pytorch implementation of NIPS 2017 paper "Modulating early visual processing by language"

Stars: ✭ 51 (-66.67%)

Mutual labels: vqa

Tbd Nets

PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"

Stars: ✭ 345 (+125.49%)

Mutual labels: vqa

Awesome Vqa

Visual Q&A reading list

Stars: ✭ 403 (+163.4%)

Mutual labels: vqa

Vizwiz Vqa Pytorch

PyTorch VQA implementation that achieved top performances in the (ECCV18) VizWiz Grand Challenge: Answering Visual Questions from Blind People

Stars: ✭ 33 (-78.43%)

Mutual labels: vqa

Nscl Pytorch Release

PyTorch implementation for the Neuro-Symbolic Concept Learner (NS-CL).

Stars: ✭ 276 (+80.39%)

Mutual labels: vqa

Mullowbivqa

Hadamard Product for Low-rank Bilinear Pooling

Stars: ✭ 57 (-62.75%)

Mutual labels: vqa

bottom-up-features

Bottom-up features extractor implemented in PyTorch.

Stars: ✭ 62 (-59.48%)

Mutual labels: vqa

Bottom Up Attention Vqa

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Stars: ✭ 667 (+335.95%)

Mutual labels: vqa

Vqa regat

Research Code for ICCV 2019 paper "Relation-aware Graph Attention Network for Visual Question Answering"

Stars: ✭ 129 (-15.69%)

Mutual labels: vqa

Vqa Tensorflow

Tensorflow Implementation of Deeper LSTM+ normalized CNN for Visual Question Answering

Stars: ✭ 98 (-35.95%)

Mutual labels: vqa

Bottom Up Attention

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

Stars: ✭ 989 (+546.41%)

Mutual labels: vqa

View All Similar Projects ➔

MFB and MFH for VQA

This project is deprecated! The Pytorch implementation of MFB(MFH)+CoAtt with pre-trained models, along with several state-of-the-art VQA models are maintained in our OpenVQA project, which is much more convenient to use!

This project is the implementation of the papers Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering (MFB) and Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering (MFH). Compared with existing state-of-the-art approaches such as MCB and MLB, our MFB models achieved superior performance on the large-scale VQA-1.0 and VQA-2.0 datasets. Moreover, MFH, the high-order extention of MFB, is also proveided to report better VQA performance. The MFB(MFH)+CoAtt network architecture for VQA is illustrated in Figure 1.

Figure 1: The MFB+CoAtt Network architecture for VQA.

Update Dec. 2nd, 2017

The 3rd-party pytorch implementation for MFB(MFH) is released here. Great thanks, Liam!

Update Sep. 5th, 2017

Using the Bottom-up and Top-Down (BUTD) image features (the model with adaptive K ranges from [10,100]) here, our single MFH+CoAtt+GloVe model achieved the overall accuracy 68.76% on the test-dev set of VQA-2.0 dataset. With an ensemble of 8 models, we achieved the new state-of-the-art performance on the VQA-2.0 dataset's leaderboard with the overall accuracy 70.92%.

Update Aug. 1st, 2019

Our solution for the VQA Challenge 2017 is updated!

We proposed a high-order extention for MFB, i.e., the Multi-modal Factorized High-order Pooling (MFH). See the flowchart in Figure 2 and the implementations in mfh_baseline and mfh-coatt-glove folders. With an ensemble of 9 MFH+CoAtt+GloVe(+VG) models, we won the 2nd place (tied with another team) in the VQA Challenge 2017. The detailed information can be found in our paper (the second paper in the CITATION section on bottom of this page).

Figure 2: The high-order MFH model which consists of p MFB blocks (without sharing parameters).

Prerequisites

Our codes is implemented based on the high-quality vqa-mcb project. The data preprocessing and and other prerequisites are the same with theirs. Before running our scripts to train or test MFB model, see the Prerequisites and Data Preprocessing sections in the README of vqa-mcb's project first.

The Caffe version required for our MFB is slightly different from the MCB. We add some layers, e.g., sum pooling, permute and KLD loss layers to the feature/20160617_cb_softattention branch of Caffe for MCB. Please checkout our caffe version here and compile it. Note that CuDNN is not compatible with sum pooling currently, you should switch it off to run the codes correctly.

Pretrained Models

We release the pretrained single model "MFB(or MFH)+CoAtt+GloVe+VG" in the papers. To the best of our knowledge, our MFH+CoAtt+GloVe+VG model report the best result (test-dev) with a single model on both the VQA-1.0 and VQA-2.0 datasets(train + val + visual genome). The corresponding results are shown in the table below. The results JSON files (results.zip for VQA-1.0) are also included in the model folders, which can be uploaded to the evaluation servers directly. Note that the models are trained with a old version of GloVe in spacy. If you use the latest one, they maybe incosistent, leading to inferior performance. I suggest training the model from scratch by yourself.

Datasets\Models	MCB	MFB	MFH	MFH (BUTD img features)
VQA-1.0	65.38%	66.87% BaiduYun	67.72% BaiduYun or Dropbox	69.82%
VQA-2.0	62.33%¹	65.09% BaiduYun	66.12% BaiduYun or Dropbox	68.76%²

¹ the MCB result on VQA-2.0 is provided by the VQA Challenge organizer with does not introdunce the GloVe embedding.

² overall: 68.76, yes/no: 84.27, num: 49.56, other: 59.89

Training from Scratch

We provide the scripts for training two MFB models from scratch, i.e., mfb-baseline and mfb-coatt-glove folders. Simply running the python scripts train_*.py to train the models from scratch.

Most of the hyper-parameters and configrations with comments are defined in the config.py file.
The solver configrations are defined in the get_solver function in the train_*.py scripts.
Pretrained GloVe word embedding model (the spacy library) is required to train the mfb-coatt-glove model. The installation instructions of spacy and GloVe model can be found here.

Evaluation

To generate an answers JSON file in the format expected by the VQA evaluation code and VQA test server, you can use eval/ensemble.py. This code can also ensemble multiple models. Running python ensemble.py will print out a help message telling you what arguments to use.

Licence

This code is distributed under MIT LICENSE. The released models are only allowed for non-commercial use.

Citation

If the codes are helpful for your research, please cite

@article{yu2017mfb,
  title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  pages={1839--1848},
  year={2017}
}

@article{yu2018beyond,
  title={Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  volume={29},
  number={12},
  pages={5947--5959},
  year={2018}
}

Concat

Zhou Yu [yuz(AT)hdu.edu.cn]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 153

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗