Dual attention network

This repository contains the code (using Tensorflow) and models for this CVPR 2017 paper (image-to-text and text-to-image task):

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 
"Dual attention networks for multimodal reasoning and matching." 
in Proc. CVPR 2017

Thanks to instructions from the author (Hyeonseob Nam), I was able to reproduce the number reported in the paper on Flickr30k:

	Image-to-Text				Text-to-Image
Method	R@1	R@5	R@10	MR	R@1	R@5	R@10	MR
DAN Paper	55.0	81.8	89.0	1	39.4	69.2	79.1	2
This Implementation	54.4	82.4	89.9	1.0	39.8	71.4	80.9	2

Dependencies

Python 2.7; TensorFlow >= 1.4.0; tqdm and nltk (for preprocessing)
Flickr30k Images and Text
Dataset splits from here. This split is the same as m-RNN.
Pretrained Resnet-152 Model from Tensorpack

Training

Get Resnet feature

$ python resnet-extractor/extract.py flickr30k_images/ ImageNet-ResNet152.npz resnet-152 --batch_size 20 --resize 448 --depth 152

Preprocess

$ python prepro_flickr30k.py splits/ results_20130124.token prepro --noword2vec --noimgfeat

Training

I use a slightly different training schedule. Batch size 256, learning rate 0.1 and 0.5 dropout for the first 60 epochs and 0.8 dropout and learning rate 0.05 for the next epochs. Also I use Adadelta as optimizer. It will take up to 9GB GPU memory and train for about 50 hours with SSDs.

(There are other options (--use_char, --concat, etc.) I haven't tried with hard negative mining yet.)

$ python main.py prepro models dan --no_wordvec --word_emb_size 512 --num_hops 2 --word_count_thres 1 --sent_size_thres 200 --word_size_thres 20 --hidden_size 512 --keep_prob 0.5 --margin 100 --num_epochs 60 --save_period 1000 --batch_size 256 --clip_gradient_norm 0.1 --init_lr 0.1 --wd 0.0005 --featpath resnet-152/ --feat_dim 14,14,2048 --hn_num 32 --is_train

Testing with the model You can download my model and put it in models/00/dan/best/ to directly run it. Also put shared.p in models/00/dan/

$ python main.py prepro models dan --no_wordvec --word_emb_size 512 --num_hops 2 --word_count_thres 1 --sent_size_thres 200 --word_size_thres 20 --hidden_size 512 --keep_prob 0.5 --margin 100 --num_epochs 60 --save_period 1000 --batch_size 256 --clip_gradient_norm 0.1 --init_lr 0.1 --wd 0.0005 --featpath resnet-152/ --feat_dim 14,14,2048 --hn_num 32 --is_test --load_best

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

JunweiLiang / DualAttentionNetwork

Programming Languages

Dual attention network

Dependencies

Training