All Projects → zhangxuying1004 → RSTNet

zhangxuying1004 / RSTNet

Licence: BSD-3-Clause license
RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words (CVPR 2021)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to RSTNet

Sightseq
Computer vision tools for fairseq, containing PyTorch implementation of text recognition and object detection
Stars: ✭ 116 (+63.38%)
Mutual labels:  transformer, image-captioning
Image-Caption
Using LSTM or Transformer to solve Image Captioning in Pytorch
Stars: ✭ 36 (-49.3%)
Mutual labels:  transformer, image-captioning
Awesome-low-level-vision-resources
A curated list of resources for Low-level Vision Tasks
Stars: ✭ 35 (-50.7%)
Mutual labels:  transformer, cvpr2021
Omninet
Official Pytorch implementation of "OmniNet: A unified architecture for multi-modal multi-task learning" | Authors: Subhojeet Pramanik, Priyanka Agrawal, Aman Hussain
Stars: ✭ 448 (+530.99%)
Mutual labels:  transformer, image-captioning
Fairseq Image Captioning
Transformer-based image captioning extension for pytorch/fairseq
Stars: ✭ 180 (+153.52%)
Mutual labels:  transformer, image-captioning
Meshed Memory Transformer
Meshed-Memory Transformer for Image Captioning. CVPR 2020
Stars: ✭ 230 (+223.94%)
Mutual labels:  transformer, image-captioning
catr
Image Captioning Using Transformer
Stars: ✭ 206 (+190.14%)
Mutual labels:  transformer, image-captioning
Show-Attend-and-Tell
A PyTorch implementation of the paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Stars: ✭ 58 (-18.31%)
Mutual labels:  image-captioning
AODA
Official implementation of "Adversarial Open Domain Adaptation for Sketch-to-Photo Synthesis"(WACV 2022/CVPRW 2021)
Stars: ✭ 44 (-38.03%)
Mutual labels:  cvpr2021
Adaptive
Pytorch Implementation of Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
Stars: ✭ 97 (+36.62%)
Mutual labels:  image-captioning
MiVOS
[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Semi-supervised VOS as well!
Stars: ✭ 302 (+325.35%)
Mutual labels:  cvpr2021
OverlapPredator
[CVPR 2021, Oral] PREDATOR: Registration of 3D Point Clouds with Low Overlap.
Stars: ✭ 293 (+312.68%)
Mutual labels:  transformer
Awesome-Captioning
A curated list of Multimodal Captioning related research(including image captioning, video captioning, and text captioning)
Stars: ✭ 56 (-21.13%)
Mutual labels:  image-captioning
visualization
a collection of visualization function
Stars: ✭ 189 (+166.2%)
Mutual labels:  transformer
transformer-models
Deep Learning Transformer models in MATLAB
Stars: ✭ 90 (+26.76%)
Mutual labels:  transformer
laravel5-hal-json
Laravel 5 HAL+JSON API Transformer Package
Stars: ✭ 15 (-78.87%)
Mutual labels:  transformer
deformer
[ACL 2020] DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering
Stars: ✭ 111 (+56.34%)
Mutual labels:  transformer
text-style-transfer-benchmark
Text style transfer benchmark
Stars: ✭ 56 (-21.13%)
Mutual labels:  transformer
LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Stars: ✭ 1,566 (+2105.63%)
Mutual labels:  transformer
Image-Captioning
Image Captioning with Keras
Stars: ✭ 60 (-15.49%)
Mutual labels:  image-captioning

RSTNet: Relationship-Sensitive Transformer Network

This repository contains the reference code for the paper RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words (CVPR 2021).

Relationship-Sensitive Transformer

Tips

Sometimes I may not be able to answer issues in time.
if you are in a hurry, you can add my wechat: zcclovelr with the remark 'RSTNet'.

Environment setup

Clone the repository and create the m2release conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate m2release

Then download spacy data by executing the following command:

python -m spacy download en

Note: Python 3.6 is required to run our code.

Data preparation

To run the code, annotations and visual features for the COCO dataset are needed.

First, most annotations have been prepared by [1], please download annotations.zip and rename the extracted folder as m2_annotations, please download image_info_test2014.json and put it into m2_annotations.

Then, visual features are computed with the code provided by [2]. To reproduce our result, please download the COCO features file X-101-features.tgz and rename the extracted folder as X101-features. Note that this visual features are huge, you can alternatively save the features as float16 for storage space saving by executing the following command:

python switch_datatype.py

Finally, in order to solve the shape difference and match the feat shape with region feat shape (50 regions), please execute the following command to reshape the visual to 49(7x7) and save all visual features as a h5py file.

python feats_process.py

Note that, you can also use my processed offline image features COCO-X-101-grid.hdf5 with extraction code wsvg and my processed online image features X101_grid_feats_coco_test.hdf5 with extraction code qzwm for convenience.

Besides, if you want to extract grid features of your custom image dataset, you can refer to the code grid-feats-vqa .

Training procedure

Run python train_language.py and python train_transformer.py in sequence using the following arguments:

Argument Possible values
--exp_name Experiment name
--batch_size Batch size (default: 10)
--workers Number of workers, accelerate model training in the xe stage.
--head Number of heads (default: 8)
--resume_last If used, the training will be resumed from the last checkpoint.
--resume_best If used, the training will be resumed from the best checkpoint.
--features_path Path to visual features file (h5py)
--annotation_folder Path to m2_annotations

For example, to train our BERT-based language model with the parameters used in our experiments, use

python train_language.py --exp_name bert_language --batch_size 50 --features_path /path/to/features --annotation_folder /path/to/annotations

to train our rstnet model with the parameters used in our experiments, use

python train_transformer.py --exp_name rstnet --batch_size 50 --m 40 --head 8 --features_path /path/to/features --annotation_folder /path/to/annotations

The figure below shows the changes of cider value during the training of rstnet. You can also visualize the training details by calling the tensorboard files in tensorboard_logs.

cider changes

Evaluation

Run python test_transformer.py to evaluate the rstnet or python test_language.py to evaluate the language model using the following arguments:

Argument Possible values
--batch_size Batch size (default: 10)
--workers Number of workers (default: 0)
--features_path Path to visual features file (h5py)
--annotation_folder Path to m2_annotations

Note that, you can also download our pretrained model files in the Pre-trained_Models folder to reproduce the our reported results. The results of offline evaluation (Karpathy test split of MS COCO) are as follows:

offline evaluation

References

[1] Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[2] Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., & Chen, X. (2020). In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Acknowledgements

Thank Cornia et.al for their open source code meshed-memory-transformer , on which our implements are based.
Thank Jiang et.al for the significant discovery in visual representation [2], which has given us a lot of inspiration.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].