All Projects → yuzcccc → Vqa Mfb

yuzcccc / Vqa Mfb

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Vqa Mfb

MICCAI21 MMQ
Multiple Meta-model Quantifying for Medical Visual Question Answering
Stars: ✭ 16 (-89.54%)
Mutual labels:  vqa
Vqa.pytorch
Visual Question Answering in Pytorch
Stars: ✭ 602 (+293.46%)
Mutual labels:  vqa
Vqa
CloudCV Visual Question Answering Demo
Stars: ✭ 57 (-62.75%)
Mutual labels:  vqa
Awesome Visual Question Answering
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
Stars: ✭ 295 (+92.81%)
Mutual labels:  vqa
Mac Network
Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)
Stars: ✭ 444 (+190.2%)
Mutual labels:  vqa
Visual Question Answering
📷 ❓ Visual Question Answering Demo and Algorithmia API
Stars: ✭ 18 (-88.24%)
Mutual labels:  vqa
rosita
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Stars: ✭ 36 (-76.47%)
Mutual labels:  vqa
Papers
读过的CV方向的一些论文,图像生成文字、弱监督分割等
Stars: ✭ 99 (-35.29%)
Mutual labels:  vqa
Mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
Stars: ✭ 4,713 (+2980.39%)
Mutual labels:  vqa
Conditional Batch Norm
Pytorch implementation of NIPS 2017 paper "Modulating early visual processing by language"
Stars: ✭ 51 (-66.67%)
Mutual labels:  vqa
Tbd Nets
PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"
Stars: ✭ 345 (+125.49%)
Mutual labels:  vqa
Awesome Vqa
Visual Q&A reading list
Stars: ✭ 403 (+163.4%)
Mutual labels:  vqa
Vizwiz Vqa Pytorch
PyTorch VQA implementation that achieved top performances in the (ECCV18) VizWiz Grand Challenge: Answering Visual Questions from Blind People
Stars: ✭ 33 (-78.43%)
Mutual labels:  vqa
Nscl Pytorch Release
PyTorch implementation for the Neuro-Symbolic Concept Learner (NS-CL).
Stars: ✭ 276 (+80.39%)
Mutual labels:  vqa
Mullowbivqa
Hadamard Product for Low-rank Bilinear Pooling
Stars: ✭ 57 (-62.75%)
Mutual labels:  vqa
bottom-up-features
Bottom-up features extractor implemented in PyTorch.
Stars: ✭ 62 (-59.48%)
Mutual labels:  vqa
Bottom Up Attention Vqa
An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.
Stars: ✭ 667 (+335.95%)
Mutual labels:  vqa
Vqa regat
Research Code for ICCV 2019 paper "Relation-aware Graph Attention Network for Visual Question Answering"
Stars: ✭ 129 (-15.69%)
Mutual labels:  vqa
Vqa Tensorflow
Tensorflow Implementation of Deeper LSTM+ normalized CNN for Visual Question Answering
Stars: ✭ 98 (-35.95%)
Mutual labels:  vqa
Bottom Up Attention
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome
Stars: ✭ 989 (+546.41%)
Mutual labels:  vqa

MFB and MFH for VQA

This project is deprecated! The Pytorch implementation of MFB(MFH)+CoAtt with pre-trained models, along with several state-of-the-art VQA models are maintained in our OpenVQA project, which is much more convenient to use!

This project is the implementation of the papers Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering (MFB) and Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering (MFH). Compared with existing state-of-the-art approaches such as MCB and MLB, our MFB models achieved superior performance on the large-scale VQA-1.0 and VQA-2.0 datasets. Moreover, MFH, the high-order extention of MFB, is also proveided to report better VQA performance. The MFB(MFH)+CoAtt network architecture for VQA is illustrated in Figure 1.

Figure 1: The MFB+CoAtt Network architecture for VQA.

Figure 1: The MFB+CoAtt Network architecture for VQA.

Update Dec. 2nd, 2017

The 3rd-party pytorch implementation for MFB(MFH) is released here. Great thanks, Liam!

Update Sep. 5th, 2017

Using the Bottom-up and Top-Down (BUTD) image features (the model with adaptive K ranges from [10,100]) here, our single MFH+CoAtt+GloVe model achieved the overall accuracy 68.76% on the test-dev set of VQA-2.0 dataset. With an ensemble of 8 models, we achieved the new state-of-the-art performance on the VQA-2.0 dataset's leaderboard with the overall accuracy 70.92%.

Update Aug. 1st, 2019

Our solution for the VQA Challenge 2017 is updated!

We proposed a high-order extention for MFB, i.e., the Multi-modal Factorized High-order Pooling (MFH). See the flowchart in Figure 2 and the implementations in mfh_baseline and mfh-coatt-glove folders. With an ensemble of 9 MFH+CoAtt+GloVe(+VG) models, we won the 2nd place (tied with another team) in the VQA Challenge 2017. The detailed information can be found in our paper (the second paper in the CITATION section on bottom of this page).

Figure 2: The high-order MFH model which consists of p MFB blocks (without sharing parameters).

Prerequisites

Our codes is implemented based on the high-quality vqa-mcb project. The data preprocessing and and other prerequisites are the same with theirs. Before running our scripts to train or test MFB model, see the Prerequisites and Data Preprocessing sections in the README of vqa-mcb's project first.

  • The Caffe version required for our MFB is slightly different from the MCB. We add some layers, e.g., sum pooling, permute and KLD loss layers to the feature/20160617_cb_softattention branch of Caffe for MCB. Please checkout our caffe version here and compile it. Note that CuDNN is not compatible with sum pooling currently, you should switch it off to run the codes correctly.

Pretrained Models

We release the pretrained single model "MFB(or MFH)+CoAtt+GloVe+VG" in the papers. To the best of our knowledge, our MFH+CoAtt+GloVe+VG model report the best result (test-dev) with a single model on both the VQA-1.0 and VQA-2.0 datasets(train + val + visual genome). The corresponding results are shown in the table below. The results JSON files (results.zip for VQA-1.0) are also included in the model folders, which can be uploaded to the evaluation servers directly. Note that the models are trained with a old version of GloVe in spacy. If you use the latest one, they maybe incosistent, leading to inferior performance. I suggest training the model from scratch by yourself.

Datasets\Models MCB MFB MFH MFH (BUTD img features)
VQA-1.0 65.38% 66.87% BaiduYun 67.72% BaiduYun or Dropbox 69.82%
VQA-2.0 62.33%1 65.09% BaiduYun 66.12% BaiduYun or Dropbox 68.76%2

1 the MCB result on VQA-2.0 is provided by the VQA Challenge organizer with does not introdunce the GloVe embedding.

2 overall: 68.76, yes/no: 84.27, num: 49.56, other: 59.89

Training from Scratch

We provide the scripts for training two MFB models from scratch, i.e., mfb-baseline and mfb-coatt-glove folders. Simply running the python scripts train_*.py to train the models from scratch.

  • Most of the hyper-parameters and configrations with comments are defined in the config.py file.
  • The solver configrations are defined in the get_solver function in the train_*.py scripts.
  • Pretrained GloVe word embedding model (the spacy library) is required to train the mfb-coatt-glove model. The installation instructions of spacy and GloVe model can be found here.

Evaluation

To generate an answers JSON file in the format expected by the VQA evaluation code and VQA test server, you can use eval/ensemble.py. This code can also ensemble multiple models. Running python ensemble.py will print out a help message telling you what arguments to use.

Licence

This code is distributed under MIT LICENSE. The released models are only allowed for non-commercial use.

Citation

If the codes are helpful for your research, please cite

@article{yu2017mfb,
  title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  pages={1839--1848},
  year={2017}
}

@article{yu2018beyond,
  title={Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  volume={29},
  number={12},
  pages={5947--5959},
  year={2018}
}

Concat

Zhou Yu [yuz(AT)hdu.edu.cn]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].