Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → kuanghuei → Scan

kuanghuei / Scan

Licence: apache-2.0

PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018)

Programming Languages

139335 projects - #7 most used programming language

Labels

deep-learning pytorch computer-vision neural-network image-captioning

Projects that are alternatives of or similar to Scan

Image-Captioining

The objective is to process by generating textual description from an image – based on the objects and actions in the image. Using generative models so that it creates novel sentences. Pipeline type models uses two separate learning process, one for language modelling and other for image recognition. It first identifies objects in image and prov…

Stars: ✭ 20 (-93.46%)

Mutual labels: image-captioning

localized-narratives

Localized Narratives

Stars: ✭ 60 (-80.39%)

Mutual labels: image-captioning

captioning chainer

A fast implementation of Neural Image Caption by Chainer

Stars: ✭ 17 (-94.44%)

Mutual labels: image-captioning

Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" （NeurIPS 2019）

Stars: ✭ 57 (-81.37%)

Mutual labels: image-captioning

Awesome-Captioning

A curated list of Multimodal Captioning related research(including image captioning, video captioning, and text captioning)

Stars: ✭ 56 (-81.7%)

Mutual labels: image-captioning

Machine-Learning

The projects I do in Machine Learning with PyTorch, keras, Tensorflow, scikit learn and Python.

Stars: ✭ 54 (-82.35%)

Mutual labels: image-captioning

Show and Tell : A Neural Image Caption Generator

Stars: ✭ 74 (-75.82%)

Mutual labels: image-captioning

Image Captioning

Image Captioning using InceptionV3 and beam search

Stars: ✭ 290 (-5.23%)

Mutual labels: image-captioning

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words (CVPR 2021)

Stars: ✭ 71 (-76.8%)

Mutual labels: image-captioning

A pytorch implemention of "StyleNet: Generating Attractive Visual Captions with Styles"

Stars: ✭ 58 (-81.05%)

Mutual labels: image-captioning

Pytorch Implementation of Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

Stars: ✭ 97 (-68.3%)

Mutual labels: image-captioning

Image-Captioning

Image Captioning with Keras

Stars: ✭ 60 (-80.39%)

Mutual labels: image-captioning

image-captioning-DLCT

Official pytorch implementation of paper "Dual-Level Collaborative Transformer for Image Captioning" (AAAI 2021).

Stars: ✭ 134 (-56.21%)

Mutual labels: image-captioning

Twitter bot for generating photo descriptions (alt text)

Stars: ✭ 21 (-93.14%)

Mutual labels: image-captioning

Tensorflow implement of paper: A Hierarchical Approach for Generating Descriptive Image Paragraphs

Stars: ✭ 43 (-85.95%)

Mutual labels: image-captioning

pix2code-pytorch

PyTorch implementation of pix2code. 🔥

Stars: ✭ 24 (-92.16%)

Mutual labels: image-captioning

Using LSTM or Transformer to solve Image Captioning in Pytorch

Stars: ✭ 36 (-88.24%)

Mutual labels: image-captioning

Adaptiveattention

Implementation of "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"

Stars: ✭ 303 (-0.98%)

Mutual labels: image-captioning

cvpr18-caption-eval

Learning to Evaluate Image Captioning. CVPR 2018

Stars: ✭ 79 (-74.18%)

Mutual labels: image-captioning

My solutions for Assignments of CS231n: Convolutional Neural Networks for Visual Recognition

Stars: ✭ 30 (-90.2%)

Mutual labels: image-captioning

View All Similar Projects ➔

Introduction

This is Stacked Cross Attention Network, source code of Stacked Cross Attention for Image-Text Matching (project page) from Microsoft AI and Research. The paper will appear in ECCV 2018. It is built on top of the VSE++ in PyTorch.

Requirements and Installation

We recommended the following dependencies.

Python 2.7
PyTorch 0.3
NumPy (>1.12.1)
TensorBoard
Punkt Sentence Tokenizer:

import nltk
nltk.download()
> d punkt

Download data

Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. The raw images can be downloaded from from their original sources here, here and here.

The precomputed image features of MS-COCO are from here. The precomputed image features of Flickr30K are extracted from the raw Flickr30K images using the bottom-up attention model from here. All the data needed for reproducing the experiments in the paper, including image features and vocabularies, can be downloaded from:

wget https://iudata.blob.core.windows.net/scan/data.zip
wget https://iudata.blob.core.windows.net/scan/vocab.zip

We refer to the path of extracted files for data.zip as $DATA_PATH and files for vocab.zip to ./vocab directory. Alternatively, you can also run vocab.py to produce vocabulary files. For example,

python vocab.py --data_path data --data_name f30k_precomp
python vocab.py --data_path data --data_name coco_precomp

Data pre-processing (Optional)

The image features of Flickr30K and MS-COCO are available in numpy array format, which can be used for training directly. However, if you wish to test on another dataset, you will need to start from scratch:

Use the bottom-up-attention/tools/generate_tsv.py and the bottom-up attention model to extract features of image regions. The output file format will be a tsv, where the columns are ['image_id', 'image_w', 'image_h', 'num_boxes', 'boxes', 'features'].
Use util/convert_data.py to convert the above output to a numpy array.

Training new models

Run train.py:

python train.py --data_path "$DATA_PATH" --data_name coco_precomp --vocab_path "$VOCAB_PATH" --logger_name runs/coco_scan/log --model_name runs/coco_scan/log --max_violation --bi_gru

Arguments used to train Flickr30K models:

Method	Arguments
SCAN t-i LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=t2i --lambda_lse=6 --lambda_softmax=9`
SCAN t-i AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=t2i --lambda_softmax=9`
SCAN i-t LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=i2t --lambda_lse=5 --lambda_softmax=4`
SCAN i-t AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=i2t --lambda_softmax=4`

Arguments used to train MS-COCO models:

Method	Arguments
SCAN t-i LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=t2i --lambda_lse=6 --lambda_softmax=9 --num_epochs=20 --lr_update=10 --learning_rate=.0005`
SCAN t-i AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=t2i --lambda_softmax=9 --num_epochs=20 --lr_update=10 --learning_rate=.0005`
SCAN i-t LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=i2t --lambda_lse=20 --lambda_softmax=4 --num_epochs=20 --lr_update=10 --learning_rate=.0005`
SCAN i-t AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=i2t --lambda_softmax=4 --num_epochs=20 --lr_update=10 --learning_rate=.0005`

Evaluate trained models

from vocab import Vocabulary
import evaluation
evaluation.evalrank("$RUN_PATH/coco_scan/model_best.pth.tar", data_path="$DATA_PATH", split="test")

To do cross-validation on MSCOCO, pass fold5=True with a model trained using --data_name coco_precomp.

Reference

If you found this code useful, please cite the following paper:

@article{lee2018stacked,
  title={Stacked Cross Attention for Image-Text Matching},
  author={Lee, Kuang-Huei and Chen, Xi and Hua, Gang and Hu, Houdong and He, Xiaodong},
  journal={arXiv preprint arXiv:1803.08024},
  year={2018}
}

License

Apache License 2.0

Acknowledgments

The authors would like to thank Po-Sen Huang and Yokesh Kumar for helping the manuscript. We also thank Li Huang, Arun Sacheti, and Bing Multimedia team for supporting this work.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 306

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (15) 🔗