All Projects → aimagelab → Show Control And Tell

aimagelab / Show Control And Tell

Licence: bsd-3-clause
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. CVPR 2019

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Show Control And Tell

Transformer image caption
Image Captioning based on Bottom-Up and Top-Down Attention model
Stars: ✭ 94 (-61.32%)
Mutual labels:  image-captioning
Image Caption Generator
[DEPRECATED] A Neural Network based generative model for captioning images using Tensorflow
Stars: ✭ 141 (-41.98%)
Mutual labels:  image-captioning
Image To Image Search
A reverse image search engine powered by elastic search and tensorflow
Stars: ✭ 200 (-17.7%)
Mutual labels:  image-captioning
Medical Report Generation
A pytorch implementation of On the Automatic Generation of Medical Imaging Reports.
Stars: ✭ 100 (-58.85%)
Mutual labels:  image-captioning
A Pytorch Tutorial To Image Captioning
Show, Attend, and Tell | a PyTorch Tutorial to Image Captioning
Stars: ✭ 1,867 (+668.31%)
Mutual labels:  image-captioning
Image Captioning
Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]
Stars: ✭ 171 (-29.63%)
Mutual labels:  image-captioning
Image Text Papers
Image Caption and Text to Image papers.
Stars: ✭ 71 (-70.78%)
Mutual labels:  image-captioning
Caption generator
A modular library built on top of Keras and TensorFlow to generate a caption in natural language for any input image.
Stars: ✭ 243 (+0%)
Mutual labels:  image-captioning
Image Caption Generator
A neural network to generate captions for an image using CNN and RNN with BEAM Search.
Stars: ✭ 126 (-48.15%)
Mutual labels:  image-captioning
Sca Cnn.cvpr17
Image Captions Generation with Spatial and Channel-wise Attention
Stars: ✭ 198 (-18.52%)
Mutual labels:  image-captioning
Video2description
Video to Text: Generates description in natural language for given video (Video Captioning)
Stars: ✭ 107 (-55.97%)
Mutual labels:  image-captioning
Sightseq
Computer vision tools for fairseq, containing PyTorch implementation of text recognition and object detection
Stars: ✭ 116 (-52.26%)
Mutual labels:  image-captioning
Fairseq Image Captioning
Transformer-based image captioning extension for pytorch/fairseq
Stars: ✭ 180 (-25.93%)
Mutual labels:  image-captioning
Arnet
CVPR 2018 - Regularizing RNNs for Caption Generation by Reconstructing The Past with The Present
Stars: ✭ 94 (-61.32%)
Mutual labels:  image-captioning
Dataturks
ML data annotations made super easy for teams. Just upload data, add your team and build training/evaluation dataset in hours.
Stars: ✭ 200 (-17.7%)
Mutual labels:  image-captioning
Automatic Image Captioning
Generating Captions for images using Deep Learning
Stars: ✭ 84 (-65.43%)
Mutual labels:  image-captioning
Show Adapt And Tell
Code for "Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner" in ICCV 2017
Stars: ✭ 146 (-39.92%)
Mutual labels:  image-captioning
Aoanet
Code for paper "Attention on Attention for Image Captioning". ICCV 2019
Stars: ✭ 242 (-0.41%)
Mutual labels:  image-captioning
Meshed Memory Transformer
Meshed-Memory Transformer for Image Captioning. CVPR 2020
Stars: ✭ 230 (-5.35%)
Mutual labels:  image-captioning
Up Down Captioner
Automatic image captioning model based on Caffe, using features from bottom-up attention.
Stars: ✭ 195 (-19.75%)
Mutual labels:  image-captioning

Show, Control and Tell

This repository contains the reference code for the paper Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions (CVPR 2019).

Please cite with the following BibTeX:

@inproceedings{cornia2019show,
  title={{Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions}},
  author={Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2019}
}

sample results

Environment setup

Clone the repository and create the sct conda environment using the conda.yml file:

conda env create -f conda.yml
conda activate sct

Our code is based on SpeakSee: a Python package that provides utilities for working with Visual-Semantic data, developed by us. The conda enviroment we provide already includes a beta version of this package.

Data preparation

COCO Entities

Download the annotations and metadata file dataset_coco.tgz (~85.6 MB) and extract it in the code folder using tar -xzvf dataset_coco.tgz.

Download the pre-computed features file coco_detections.hdf5 (~53.5 GB) and place it under the datasets/coco folder, which gets created after decompressing the annotation file.

Flickr30k Entities

As before, download the annotations and metadata file dataset_flickr.tgz (~32.8 MB) and extract it in the code folder using tar -xzvf dataset_flickr.tgz.

Download the pre-computed features file flickr30k_detections.hdf5 (~13.1 GB) and place it under the datasets/flickr folder, which gets created after decompressing the annotation file.

Download from Google Drive

A copy of all files is also available at this Google Drive folder.

Evaluation

To reproduce the results in the paper, download the pretrained model file saved_models.tgz (~4 GB) and extract it in the code folder with tar -xzvf saved_models.tgz.

Sequence controllability

Run python test_region_sequence.py using the following arguments:

Argument Possible values
--dataset coco, flickr
--exp_name ours, ours_without_visual_sentinel, ours_with_single_sentinel
--sample_rl If used, tests the model with CIDEr optimization
--sample_rl_nw If used, tests the model with CIDEr + NW optimization
--batch_size Batch size (default: 16)
--nb_workers Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 2, bottom right), use:

python test_region_sequence.py --dataset coco --exp_name ours --sample_rl_nw  

Set controllability

Run python test_region_set.py using the following arguments:

Argument Possible values
--dataset coco, flickr
--exp_name ours, ours_without_visual_sentinel, ours_with_single_sentinel
--sample_rl If used, tests the model with CIDEr optimization
--sample_rl_nw If used, tests the model with CIDEr + NW optimization
--batch_size Batch size (default: 16)
--nb_workers Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 4, bottom row), use:

python test_region_set.py --dataset coco --exp_name ours --sample_rl_nw  

Expected output

Under logs/, you may also find the expected output of all experiments.

Training procedure

Run python train.py using the following arguments:

Argument Possible values
--exp_name Experiment name
--batch_size Batch size (default: 100)
--lr Initial learning rate (default: 5e-4)
--nb_workers Number of workers (default: 0)
--sample_rl If used, the model will be trained with CIDEr optimization
--sample_rl_nw If used, the model will be trained with CIDEr + NW optimization

For example, to train the model with cross entropy, use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-4 

To train the model with CIDEr optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl

To train the model with CIDEr + NW optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl_nw

Note: the current training code only supports the use of the COCO Entities dataset.

model

COCO Entities

If you want to use only the annotations of our COCO Entities dataset, you can download the annotation file coco_entities_release.json (~403 MB).

The annotation file contains a python dictionary structured as follows:

coco_entities_release.json
 └── <id_image>
      └── <caption>
           └── 'det_sequences'
           └── 'noun_chunks'
           └── 'detections'
           └── 'split'

In details, for each image-caption pair, we provide the following information:

  • det_sequences, which contains a list of detection classes associated to each word of the caption (for an exact match with caption words, split the caption by spaces). None indicates the words that are not part of noun chunks, while _ indicates noun chunk words for which an association with a detection in the image was not possible.
  • noun_chunks, which is a list of tuples representing the noun chunks of the captions associated with a detection in the image. Each tuple is composed by two elements: the first one represents the noun chunk in the caption, while the second is the detection class associated to that noun chunk.
  • detections, which contains a dictionary with a number of elements equal to the number of detection classes associated with at least a noun chunk in the caption. For each detection class, it provides a list of tuples representing the image regions detected by Faster R-CNN re-trained on Visual Genome [1] and corresponding to that detection class. Each tuple is composed by the detection id and the corresponding boundig box in the form [x1, y1, x2, y2]. The detection id can be used to recover the detection feature vector from the pre-computed features file coco_detections.hdf5 (~53.5 GB). See the demo section below for more details.
  • split, which indicates the dataset split of that sample (i.e. train, val or test) following the COCO splits provided by [2].

Note that this annotation file includes all image-caption pairs for which at least one noun chunk-detection association has been found. However, in validation and testing phase of our controllable captioning model, we dropped all captions with empty region sets (i.e. those captions with at least one _ in the det_sequences field).

coco entities

By downloading the dataset, you declare that you will use it for research and educational purposes only, any commercial use is prohibited.

Demo

An example of how to use the COCO Entities annotations can be found in the coco_entities_demo.ipynb file.

References

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[2] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

Contact

If you have any general doubt about our work, please use the public issues section on this github repo. Alternatively, drop us an e-mail at marcella.cornia [at] unimore.it or lorenzo.baraldi [at] unimore.it.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].