All Projects → lichengunc → Mattnet

lichengunc / Mattnet

Licence: mit
MAttNet: Modular Attention Network for Referring Expression Comprehension

Projects that are alternatives of or similar to Mattnet

Beakerx
Beaker Extensions for Jupyter Notebook
Stars: ✭ 2,594 (+1018.1%)
Mutual labels:  jupyter-notebook
Machine Learning By Andrew Ng In Python
Documenting my python implementation of Andrew Ng's Machine Learning course
Stars: ✭ 231 (-0.43%)
Mutual labels:  jupyter-notebook
Aotodata
朱小五写文章涉及到的数据分析,爬虫,源数据
Stars: ✭ 232 (+0%)
Mutual labels:  jupyter-notebook
Structuredinference
Structured Inference Networks for Nonlinear State Space Models
Stars: ✭ 230 (-0.86%)
Mutual labels:  jupyter-notebook
Nlp made easy
Explains nlp building blocks in a simple manner.
Stars: ✭ 232 (+0%)
Mutual labels:  jupyter-notebook
Imaginary Numbers Are Real
Code To Accompany YouTube Series Imaginary Numbers Are Real
Stars: ✭ 231 (-0.43%)
Mutual labels:  jupyter-notebook
Mydatascienceportfolio
Applying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (-2.16%)
Mutual labels:  jupyter-notebook
Statannot
add statistical annotations (pvalue significance) on an existing boxplot generated by seaborn boxplot
Stars: ✭ 228 (-1.72%)
Mutual labels:  jupyter-notebook
Neural Network From Scratch
Ever wondered how to code your Neural Network using NumPy, with no frameworks involved?
Stars: ✭ 230 (-0.86%)
Mutual labels:  jupyter-notebook
Introduction To Python
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant white space. (This repository contains Python 3 Code)
Stars: ✭ 232 (+0%)
Mutual labels:  jupyter-notebook
All Python Codes Of Ztm Course By Andrei Neagoie
Stars: ✭ 229 (-1.29%)
Mutual labels:  jupyter-notebook
Hamiltonian Nn
Code for our paper "Hamiltonian Neural Networks"
Stars: ✭ 229 (-1.29%)
Mutual labels:  jupyter-notebook
Installations mac ubuntu windows
Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).
Stars: ✭ 231 (-0.43%)
Mutual labels:  jupyter-notebook
Quantiacs Python
Python version of Quantiacs toolbox and sample trading strategies
Stars: ✭ 230 (-0.86%)
Mutual labels:  jupyter-notebook
Learn Statistical Learning Method
Implementation of Statistical Learning Method, Second Edition.《统计学习方法》第二版,算法实现。
Stars: ✭ 228 (-1.72%)
Mutual labels:  jupyter-notebook
Fusenet
Deep fusion project of deeply-fused nets, and the study on the connection to ensembling
Stars: ✭ 230 (-0.86%)
Mutual labels:  jupyter-notebook
Dagmm
My attempt at reproducing the paper Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection
Stars: ✭ 231 (-0.43%)
Mutual labels:  jupyter-notebook
Pystacknet
Stars: ✭ 232 (+0%)
Mutual labels:  jupyter-notebook
Wassdistance
Approximating Wasserstein distances with PyTorch
Stars: ✭ 229 (-1.29%)
Mutual labels:  jupyter-notebook
Stylegan2 Face Modificator
Simple Encoder, Generator and Face Modificator with StyleGAN2
Stars: ✭ 232 (+0%)
Mutual labels:  jupyter-notebook

PyTorch Implementation of MAttNet

Introduction

This repository is Pytorch implementation of MAttNet: Modular Attention Network for Referring Expression Comprehension in CVPR 2018. Refering Expressions are natural language utterances that indicate particular objects within a scene, e.g., "the woman in red sweater", "the man on the right", etc. For robots or other intelligent agents communicating with people in the world, the ability to accurately comprehend such expressions will be a necessary component for natural interactions. In this project, we address referring expression comprehension: localizing an image region described by a natural language expression. Check our paper and online demo for more details. Examples are shown as follows:

Prerequisites

  • Python 2.7
  • Pytorch 0.2 (may not work with 1.0 or higher)
  • CUDA 8.0

Installation

  1. Clone the MAttNet repository
git clone --recursive https://github.com/lichengunc/MAttNet
  1. Prepare the submodules and associated data
  • Mask R-CNN: Follow the instructions of my mask-faster-rcnn repo, preparing everything needed for pyutils/mask-faster-rcnn. You could use cv/mrcn_detection.ipynb to test if you've get Mask R-CNN ready.

  • REFER API and data: Use the download links of REFER and go to the foloder running make. Follow data/README.md to prepare images and refcoco/refcoco+/refcocog annotations.

  • refer-parser2: Follow the instructions of refer-parser2 to extract the parsed expressions using Vicente's R1-R7 attributes. Note this sub-module is only used if you want to train the models by yourself.

Training

  1. Prepare the training and evaluation data by running tools/prepro.py:
python tools/prepro.py --dataset refcoco --splitBy unc
  1. Extract features using Mask R-CNN, where the head_feats are used in subject module training and ann_feats is used in relationship module training.
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_head_feats.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_ann_feats.py --dataset refcoco --splitBy unc
  1. Detect objects/masks and extract features (only needed if you want to evaluate the automatic comprehension). We empirically set the confidence threshold of Mask R-CNN as 0.65.
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect.py --dataset refcoco --splitBy unc --conf_thresh 0.65
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect_to_mask.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_det_feats.py --dataset refcoco --splitBy unc
  1. Train MAttNet with ground-truth annotation:
./experiments/scripts/train_mattnet.sh GPU_ID refcoco unc

During training, you may want to use cv/inpect_cv.ipynb to check the training/validation curves and do cross validation.

Evaluation

Evaluate MAttNet with ground-truth annotation:

./experiments/scripts/eval_easy.sh GPUID refcoco unc

If you detected/extracted the Mask R-CNN results already (step 3 above), now you can evaluate the automatic comprehension accuracy using Mask R-CNN detection and segmentation:

./experiments/scripts/eval_dets.sh GPU_ID refcoco unc
./experiments/scripts/eval_masks.sh GPU_ID refcoco unc

Pre-trained Models

In order to get the results in our paper, please follow Training Step 1-3 for data and feature preparation then run Evaluation Step 1. We provide the pre-trained models for RefCOCO, RefCOCO+ and RefCOCOg. Download and put them under ./output folder.

  1. RefCOCO: Pre-trained model (56M)
Localization (gt-box) Localization (Mask R-CNN) Segmentation (Mask R-CNN)
val test A test B
85.57% 85.95% 84.36%
val test A test B
76.65% 81.14% 69.99%
val test A test B
75.16% 79.55% 68.87%
  1. RefCOCO+: Pre-trained model (56M)
Localization (gt-box) Localization (Mask R-CNN) Segmentation (Mask R-CNN)
val test A test B
71.71% 74.28% 66.27%
val test A test B
65.33% 71.62% 56.02%
val test A test B
64.11% 70.12% 54.82%
  1. RefCOCOg: Pre-trained model (58M)
Localization (gt-box) Localization (Mask R-CNN) Segmentation (Mask R-CNN)
val test
78.96% 78.51%
val test
66.58% 67.27%
val test
64.48% 65.60%

Pre-computed detections/masks

We provide the detected boxes/masks for those who are interested in automatic comprehension. This was done using Training Step 3. Note our Mask R-CNN is trained on COCO’s training images, excluding those in RefCOCO, RefCOCO+, and RefCOCOg’s validation+testing. That said it is unfair to use the other off-the-shelf detectors trained on whole COCO set for this task.

Demo

Run cv/example_demo.ipynb for demo example. You can also check our Online Demo.

Citation

@inproceedings{yu2018mattnet,
  title={MAttNet: Modular Attention Network for Referring Expression Comprehension},
  author={Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L},
  booktitle={CVPR},
  year={2018}
}

License

MAttNet is released under the MIT License (refer to the LICENSE file for details).

A few notes

I'd like to share several thoughts after working on Referring Expressions for 3 years (since 2015):

  • Model Improvement: I'm satisfied with this model architecture but still feel the context information is not fully exploited. We tried the context of visual comparison in our ECCV2016. It worked well but relied too much on the detector. That's why I removed the appearance difference in this paper. (Location comparison still remains as it's too important.) I'm looking forward to seeing more robust and interesting context proposed in the future. Another direction is the end-to-end multi-task training. Current model loses some concepts after going through Mask R-CNN. For example, Mask R-CNN can perfectly detect (big) sports ball in an image but MAttNet can no longer recognize it. The reason is we are training the two models seperately and our RefCOCO dataset do not have ball-related expressions.

  • Borrowing External Concepts: Current datasets (RefCOCO, RefCOCO+, RefCOCOg) have bias toward person category. Around half of the expressions are related to person. However, in real life people may also be interested in referring other common objects (cup, bottle, book) or even stuff (sky, tree or building). As RefCOCO already provides common referring expression structure, the (only) piece left is getting the universal objects/stuff concepts, which could be borrowed from external datasets/tasks.

  • Referring Expression Generation (REG): Surprisingly few paper works on referring expression generation task so far! Dialogue is important. Referring to things is always the first step for computer-to-human interaction. (I don't think people would love to use a passive computer or robot which cannot talk.) In our CVPR2017, we actually collected more testing expressions for better REG evaluation. (Check REFER2 for the data. The only difference with REFER is it contains more testing expressions on RefCOCO and RefCOCO+.) While we achieved the SOA results in the paper, there should be plentiful space for further improvement. Our speaker model can only utter "boring" and "safe" expressions, thus cannot well specify every object in an image. GAN or a Modular Speaker might be effective weapons as future work.

  • Data Collection: Larger Referring Expressions dataset is apparently the most straight-forward way to improve the performance of any model. You might have two questions: 1) What data should we collect? 2) How do we collect the dataset? A larger Referring Expression dataset covering the whole MS COCO is expected (of course). This will also make end-to-end learning possible in the future. Task-specific dataset is also interesting. Since ReferIt Game, there have been several datasets in different domains, e.g., video, dialogue and spoken language. Note you may be careful about the problem setting. Randomly fitting referring expressions into a task (just for paper publication) is boring. As for the collection method, I prefer the way used in our ealy work ReferIt Game. The collected expressions might be slightly short (compared with image captioning datasets), but that is how we refer things naturally in daily life.

Authorship

This project is maintained by Licheng Yu.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].