All Projects → yukezhu → visual7w-toolkit

yukezhu / visual7w-toolkit

Licence: MIT License
Toolkit for Visual7W visual question answering dataset

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to visual7w-toolkit

table-evaluator
Evaluate real and synthetic datasets with each other
Stars: ✭ 44 (-25.42%)
Mutual labels:  evaluation
eval-estree-expression
Safely evaluate JavaScript (estree) expressions, sync and async.
Stars: ✭ 22 (-62.71%)
Mutual labels:  evaluation
edd
Erlang Declarative Debugger
Stars: ✭ 20 (-66.1%)
Mutual labels:  evaluation
pdq evaluation
Evaluation code for using probabilistic detection quality (PDQ) measure for probabilistic object detection tasks. Currently supports COCO and robotic vision challenge (RVC) data.
Stars: ✭ 34 (-42.37%)
Mutual labels:  evaluation
Machine-learning-toolkits-with-python
Machine learning toolkits with Python
Stars: ✭ 31 (-47.46%)
Mutual labels:  evaluation
midi degradation toolkit
A toolkit for generating datasets of midi files which have been degraded to be 'un-musical'.
Stars: ✭ 29 (-50.85%)
Mutual labels:  evaluation
texpr
Boolean evaluation and digital calculation expression engine for GO
Stars: ✭ 18 (-69.49%)
Mutual labels:  evaluation
summary-explorer
Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.
Stars: ✭ 34 (-42.37%)
Mutual labels:  evaluation
travelling-salesman
Rules for Kiwi.com travelling salesman competition
Stars: ✭ 14 (-76.27%)
Mutual labels:  evaluation
WhatsMissingInGeoparsing
The accompanying code and data for the Springer 2017 publication "What's missing in geographical parsing?" in Language Resources and Evaluation.
Stars: ✭ 15 (-74.58%)
Mutual labels:  evaluation
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Stars: ✭ 13,870 (+23408.47%)
Mutual labels:  evaluation
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+542.37%)
Mutual labels:  evaluation
AIODrive
Official Python/PyTorch Implementation for "All-In-One Drive: A Large-Scale Comprehensive Perception Dataset with High-Density Long-Range Point Clouds"
Stars: ✭ 32 (-45.76%)
Mutual labels:  evaluation
PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Stars: ✭ 59 (+0%)
Mutual labels:  evaluation
webMUSHRA
a MUSHRA compliant web audio API based experiment software
Stars: ✭ 171 (+189.83%)
Mutual labels:  evaluation
go-eek
Blazingly fast and safe Go evaluation library, created on top of Go pkg/plugin package
Stars: ✭ 37 (-37.29%)
Mutual labels:  evaluation
image-matching-toolbox
This is a toolbox repository to help evaluate various methods that perform image matching from a pair of images.
Stars: ✭ 252 (+327.12%)
Mutual labels:  evaluation
booleval
Header-only C++17 library for evaluating logical expressions.
Stars: ✭ 54 (-8.47%)
Mutual labels:  evaluation
NLP-tools
Useful python NLP tools (evaluation, GUI interface, tokenization)
Stars: ✭ 39 (-33.9%)
Mutual labels:  evaluation
expression-evaluator
A math expression evaluator built with JavaScript
Stars: ✭ 25 (-57.63%)
Mutual labels:  evaluation

Visual7W Toolkit

alt text

Introduction

Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. Please check out our CVPR'16 paper for more details. This toolkit is used for parsing dataset files and evaluating model performances. Please contact Yuke Zhu for questions, comments, or bug reports.

Dataset Overview

The Visual7W dataset is collected on 47,300 COCO images. In total, it has 327,939 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories. In addition, we provide complete grounding annotations that link the object mentions in the QA sentences to their bounding boxes in the images and therefore introduce a new QA type with image regions as the visually grounded answers. We refer to questions with textual answers as telling QA and to such with visual answers as pointing QA. The figure above shows some examples in the Visual7W dataset, where the first row shows telling QA examples, and the second row shows pointing QA examples.

Visual7W constitutes a part of the Visual Genome project. Visual Genome contains 1.7 million QA pairs of the 7W question types, which offers the largest visual QA collection to date for training models. The QA pairs in Visual7W are a subset of the 1.7 million QA pairs from Visual Genome. Moreover, Visual7W includes extra annotations such as object groundings, multiple choices and human experiments, making it a clean and complete benchmark for evaluation and analysis.

Evaluation Methods

We use two evaluation methods to measure performance. Multiple-choice evaluation aims at selecting the correct option from a pre-defined pool of candidate answers. Open-ended evaluation aims at predicting a freeform textual answer given a question and the image. This toolkit provides utility functions to evaluate performances in both methods. We explain the details of these two methods below.

  1. Multiple-choice QA: We provide four human-generated multiple-choice answers for each question, where one of them is the ground-truth. We say the model is correct on a question if it selects the correct answer candidate. Accuracy is used to measure the performance. This is the default (and recommended) evaluation method for Visual7W.

  2. Open-ended QA: similar to the top-5 criteria used in ImageNet challenges, we let the model to make k different freeform predictions. We say the model is correct on a question if one of the k predictions matches exactly with the ground-truth. Accuracy is used to measure the performance. This evlaution method only applies to the telling QA tasks with textual answers.

How to Use

Before using this toolkit, make sure that you have downloaded the Visual7W dataset. You can use our downloading script in datasets/[dataset-name]/download_dataset.sh to fetch the database json to the local disk.

Telling QA

We implement a most-frequent-answer (MFA) baseline in predict_baseline.py. For open-ended evaluation, we use the top-k most frequent training set answers as the predictions for all test questions. For multiple-choice evaluation, we select the candidate answer with the highest training set frequency for each test question.

In this demo, we perform open-ended evaluation for telling QA. To run the MFA baseline on the validation set, use the following command:

python predict_baseline.py --dataset visual7w-telling \
                           --mode open \
                           --topk 100 \
                           --split val \
                           --result_path results

It will generate a prediction file result_visual7w-telling_open.json in the results folder. Type python predict_baseline.py -h to learn more about the input arguments.

The script below shows how to use the evaluation script evaluate.py to check the performances of the open-ended predictions in the result_visual7w-telling_open.json file. Type python evaluate.py -h to learn more about the input arguments.

python evaluate.py --dataset visual7w-telling \
                   --mode open \
                   --topk 100 \
                   --split val \
                   --results results/result_visual7w-telling_open.json \
                   --verbose 1

You will see the similar results as below:

2015-11-16 19:18:48,834 Evaluated on 28,020 QA pairs with top-100 predictions.
2015-11-16 19:18:48,834 Overall accuracy = 0.370
2015-11-16 19:18:48,835 Question type "what" accuracy = 0.377 (5011 / 13296)
2015-11-16 19:18:48,835 Question type "who" accuracy = 0.377 (1086 / 2879)
2015-11-16 19:18:48,835 Question type "when" accuracy = 0.529 (668 / 1262)
2015-11-16 19:18:48,835 Question type "how" accuracy = 0.726 (3056 / 4211)
2015-11-16 19:18:48,835 Question type "where" accuracy = 0.100 (459 / 4590)
2015-11-16 19:18:48,835 Question type "why" accuracy = 0.051 (91 / 1782)

Similarly, we can perform multiple-choice evaluation on the MFA baseline.

python evaluate.py --dataset visual7w-telling \
                   --mode mc \
                   --split val \
                   --results results/result_visual7w-telling_mc.json \
                   --verbose 1

Change the mode parameter to mc when performing multiple-choice evaluation.

Pointing QA

Similary we can use the toolkit to evaluate pointing QA. For demo purpose, we implement a very simple baseline, which picks a random answer out of the four multiple-choice candidates. You can run the baseline as follows. Please make sure that you have downloaded the dataset json before running the code.

python predict_baseline.py --dataset visual7w-pointing \
                           --mode mc \
                           --split val \
                           --result_path results

In this case, we can still use evaluate.py to evaluate the performance. You should expect something very close to chance performance (25%). Let's see if that is true.

python evaluate.py --dataset visual7w-pointing \
                   --mode mc \
                   --split val \
                   --results results/result_visual7w-pointing_mc.json \
                   --verbose 1

Here is what I got.

2015-11-13 14:45:56,363 Evaluated on 36,990 QA pairs with top-1 predictions.
2015-11-13 14:45:56,363 Overall accuracy = 0.249
2015-11-13 14:45:56,363 Question type "which" accuracy = 0.249 (9209 / 36990)

Evaluating Your Own Models

In order to evaluate your own model, please check the format of the sample outputs produced by the baseline script. In short, a prediction file contains a list of predicted answers in the candidates arrays. For multiple-choice QA, the candidates arrays contain only one element, which is the selected multiple-choice option. For open-ended QA, the candidates arrays can contain more than one (up to k) predictions, where we use the one-of-k metric to evaluate the performance.

More specifically the result format reads:

[
	{
		"candidates":[
			"answer":<answer-id for pointing QAs> OR <answer-string for telling QAs>
		],
		"question":<question-string>,
		"qa_id":<qa-id>
	}
]

Reference

Please acknowledge the our CVPR'16 paper if you are using this code.

@InProceedings{zhu2016cvpr,
  title = {{Visual7W: Grounded Question Answering in Images}},
  author = {Yuke Zhu and Oliver Groth and Michael Bernstein and Li Fei-Fei},
  booktitle = {{IEEE Conference on Computer Vision and Pattern Recognition}},
  year = 2016,
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].