All Projects → njchoma → Transformer_image_caption

njchoma / Transformer_image_caption

Licence: mit
Image Captioning based on Bottom-Up and Top-Down Attention model

Projects that are alternatives of or similar to Transformer image caption

Linear Attention Recurrent Neural Network
A recurrent attention module consisting of an LSTM cell which can query its own past cell states by the means of windowed multi-head attention. The formulas are derived from the BN-LSTM and the Transformer Network. The LARNN cell with attention can be easily used inside a loop on the cell state, just like any other RNN. (LARNN)
Stars: ✭ 119 (+26.6%)
Mutual labels:  jupyter-notebook, attention-model
Up Down Captioner
Automatic image captioning model based on Caffe, using features from bottom-up attention.
Stars: ✭ 195 (+107.45%)
Mutual labels:  jupyter-notebook, image-captioning
Image Caption Generator
A neural network to generate captions for an image using CNN and RNN with BEAM Search.
Stars: ✭ 126 (+34.04%)
Mutual labels:  image-captioning, attention-model
Image Captioning
Image Captioning using InceptionV3 and beam search
Stars: ✭ 290 (+208.51%)
Mutual labels:  jupyter-notebook, image-captioning
Show Attend And Tell
TensorFlow Implementation of "Show, Attend and Tell"
Stars: ✭ 869 (+824.47%)
Mutual labels:  jupyter-notebook, image-captioning
Bottom Up Attention
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome
Stars: ✭ 989 (+952.13%)
Mutual labels:  jupyter-notebook, image-captioning
Image Caption Generator
[DEPRECATED] A Neural Network based generative model for captioning images using Tensorflow
Stars: ✭ 141 (+50%)
Mutual labels:  jupyter-notebook, image-captioning
Adaptiveattention
Implementation of "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"
Stars: ✭ 303 (+222.34%)
Mutual labels:  jupyter-notebook, image-captioning
Cs231
Complete Assignments for CS231n: Convolutional Neural Networks for Visual Recognition
Stars: ✭ 317 (+237.23%)
Mutual labels:  jupyter-notebook, image-captioning
Neural Image Captioning
Implementation of Neural Image Captioning model using Keras with Theano backend
Stars: ✭ 12 (-87.23%)
Mutual labels:  jupyter-notebook, image-captioning
Automatic Image Captioning
Generating Captions for images using Deep Learning
Stars: ✭ 84 (-10.64%)
Mutual labels:  jupyter-notebook, image-captioning
Biggansarewatching
Authors official implementation of "Big GANs Are Watching You" pre-print
Stars: ✭ 94 (+0%)
Mutual labels:  jupyter-notebook
Tutorials
Tutorials on optimization and coding skills
Stars: ✭ 93 (-1.06%)
Mutual labels:  jupyter-notebook
Classification Of Hyperspectral Image
Classification of the Hyperspectral Image Indian Pines with Convolutional Neural Network
Stars: ✭ 93 (-1.06%)
Mutual labels:  jupyter-notebook
Satellite imagery python
Sample sample scripts and notebooks on processing satellite imagery
Stars: ✭ 93 (-1.06%)
Mutual labels:  jupyter-notebook
Nf Jax
Normalizing Flows in Jax
Stars: ✭ 94 (+0%)
Mutual labels:  jupyter-notebook
Story2hallucination
Stars: ✭ 91 (-3.19%)
Mutual labels:  jupyter-notebook
Doc Browser
A documentation browser with support for DevDocs, Dash and Hoogle, written in Haskell and QML
Stars: ✭ 93 (-1.06%)
Mutual labels:  jupyter-notebook
Cirrus
Serverless ML Framework
Stars: ✭ 93 (-1.06%)
Mutual labels:  jupyter-notebook
Prob mbrl
A library of probabilistic model based RL algorithms in pytorch
Stars: ✭ 93 (-1.06%)
Mutual labels:  jupyter-notebook

Image Captioning based on Bottom-Up and Top-Down Attention model

Our overall approach centers around the Bottom-Up and Top-Down Attention model, as designed by Anderson et al. We used this framework as a starting point for further experimentation, implementing, in addition to various hyperparameter tunings, two additional model architectures. First, we reduced the complexity of Bottom-Up and Top-Down by considering only a simple LSTM architecture. Then, taking inspiration from the Transformer architecture, we implement a non-recurrent model which does not need to keep track of an internal state across time. Our results are comparable to the author’s implementation of the Bottom- Up and Top Down Attention model. Our code serves as a baseline for future experiments which are done using the Pytorch framework.

Results

Model BLEU-1 BLEU-4 METEOR ROUGE-L CIDEr
Author's implementation 77.2 36.2 27.0 56.4 113.5
Our implementation 73.8 32.9 26.0 53.7 103.8

Our implementation can be improved using some more hyperparameter tuning and using some tricks like gradient clipping and using ReLU instead of tanh. We hope our model serves as the baseline for future experiments.

We use the Adam optimizer with a learning rate of 0.0001 and teacher forcing during training with a batch size of 100. The training was completed with Nvidia P40 GPUs in approximately 8 GPU hours which is significantly less than author’s 18 GPU hours on Titan X GPUs. During testing, a beam width of 5 was found to be the most effective.

Ablation study:

Model BLEU-1 BLEU-4 METEOR ROUGE-L CIDEr
Bottom-Up Top-Down 73.8 32.9 26.0 53.7 103.8
w/o attention (Simple LSTM) 67 24.9 21.9 49.4 77.6
w/o Beam search 74.2 31.3 25.9 54.0 102.4
Teacher forcing (p=0.5) 74 31.4 25.1 53.5 100.0
Transformer-inspired 57.4 9.7 15.2 41.2 42.5

Getting Started

Machine configuration used for testing: Nvidia P40 GPUs card with 24GB memory (though a machine with lesser memory would work just fine)

We use the Karpathy splits as described in Deep visual-semantic alignments for generating image descriptions.. The Bottom-Up image features are used directly from here. Please refer to this repo for clarifications. The annotations are downloaded from the COCO website (2014 train val annotations). All the models have been trained from scratch.

The code takes around 8 hours to train on the karpathy train split.

Prerequisites

What things you need to install the software and how to install them

Software used:

  1. Pytorch 0.4.1
  2. Python 2.7

Dependencies: Create a conda environment using the captioning_env.yml file. Use: conda env create -f captioning_env.yml

If you are not using conda as a package manager, refer to the yml file and install the libraries manually.

Running the code

  1. Data
  1. Training
  • Edit the main.sh file by changing the path variables and activating the appropriate conda environment. Run the same script with the appropriate arguments. The arguments have been listed in the src/utils_experiment.py file.

  • After the model has been trained, run src/evaluate_test.py

  1. Evaluation

License

This project is licensed under the MIT License - see the LICENSE file for details

Contributors

  1. Nicholas Choma (New York University)
  2. Omkar Damle (New York University)

Note: Equal contribution from both contributors.

This code was produced as a part of my course project at New York University. I would like to thank Prof. Fergus for his guidance and providing access to the GPUs.

References:

  1. Code for metrics evaluation was borrowed from https://github.com/tylin/coco-caption
  2. Code for converting the image features from tsv to hdf5 is based on https://github.com/hengyuan-hu/bottom-up-attention-vqa
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].