MILVLG / rosita

Licence: Apache-2.0 License

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to rosita

[TPAMI Special Issue on ICCV 2021 Best Papers, Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Stars: ✭ 57 (+58.33%)

Mutual labels: vqa, vision-and-language, pre-training

iMIX

A framework for Multimodal Intelligence research from Inspur HSSLAB.

Stars: ✭ 21 (-41.67%)

Mutual labels: vqa, vision-and-language

pytorch violet

A PyTorch implementation of VIOLET

Stars: ✭ 119 (+230.56%)

Mutual labels: vision-and-language, pre-training

calvin

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Stars: ✭ 105 (+191.67%)

Mutual labels: vision-and-language

awesome-graph-self-supervised-learning

Awesome Graph Self-Supervised Learning

Stars: ✭ 805 (+2136.11%)

Mutual labels: pre-training

VarCLR

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Stars: ✭ 30 (-16.67%)

Mutual labels: pre-training

FigureQA-baseline

TensorFlow implementation of the CNN-LSTM, Relation Network and text-only baselines for the paper "FigureQA: An Annotated Figure Dataset for Visual Reasoning"

Stars: ✭ 28 (-22.22%)

Mutual labels: vqa

Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Stars: ✭ 484 (+1244.44%)

Mutual labels: vqa

robo-vln

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Stars: ✭ 34 (-5.56%)

Mutual labels: vision-and-language

stanford-cs231n-assignments-2020

This repository contains my solutions to the assignments for Stanford's CS231n "Convolutional Neural Networks for Visual Recognition" (Spring 2020).

Stars: ✭ 84 (+133.33%)

Mutual labels: vision-and-language

AoA-pytorch

A Pytorch implementation of Attention on Attention module (both self and guided variants), for Visual Question Answering

Stars: ✭ 33 (-8.33%)

Mutual labels: vqa

probnmn-clevr

Code for ICML 2019 paper "Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering" [long-oral]

Stars: ✭ 63 (+75%)

Mutual labels: vqa

wikiHow paper list

A paper list of research conducted based on wikiHow

Stars: ✭ 25 (-30.56%)

Mutual labels: vision-and-language

synse-zsl

Official PyTorch code for the ICIP 2021 paper 'Syntactically Guided Generative Embeddings For Zero Shot Skeleton Action Recognition'

Stars: ✭ 14 (-61.11%)

Mutual labels: vision-and-language

TRAR-VQA

[ICCV 2021] TRAR: Routing the Attention Spans in Transformers for Visual Question Answering -- Official Implementation

Stars: ✭ 49 (+36.11%)

Mutual labels: vision-and-language

X-VLM

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

Stars: ✭ 283 (+686.11%)

Mutual labels: vision-and-language

VidSitu

[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

Stars: ✭ 41 (+13.89%)

Mutual labels: vision-and-language

clip playground

An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities

Stars: ✭ 80 (+122.22%)

Mutual labels: vision-and-language

vqa-soft

Accompanying code for "A Simple Loss Function for Improving the Convergence and Accuracy of Visual Question Answering Models" CVPR 2017 VQA workshop paper.

Stars: ✭ 14 (-61.11%)

Mutual labels: vqa

DVQA dataset

DVQA Dataset: A Bar chart question answering dataset presented at CVPR 2018

Stars: ✭ 20 (-44.44%)

Mutual labels: vqa

View All Similar Projects ➔

ROSITA

News & Updates

(24/08/2021)

Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model.

(15/08/2021)

Release the basic framework for ROSITA, including the pretrained base ROSITA model, as well as the scripts to run the fine-tuning and evaluation on three downstream tasks (i.e., VQA, REC, ITR) over six datasets.

Introduction

This repository contains source code necessary to reproduce the results presented in our ACM MM paper ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, which encodes the cROSs- and InTrA-model prior knowledge in a in a unified scene graph to perform knowledge-guided vision-and-language pretraining. Compared with existing counterparts, ROSITA learns better fine-grained semantic alignments across different modalities, thus improving the capability of the pretrained model.

Performance

We compare ROSITA against existing state-of-the-art VLP methods on three downstream tasks. All methods use the base model of Transformer for a fair comparison. The trained checkpoints to reproduce these results are provided in finetune.md.

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO R@1 \| R@5 \| R@10}}	^{_{TR-COCO R@1 \| R@5 \| R@10}}	^{_{IR-Flickr R@1 \| R@5 \| R@10}}	^{_{TR-Flickr R@1 \| R@5 \| R@10}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

Installation

Software and Hardware Requirements

We recommand a workstation with 4 GPU (>= 24GB, e.g., RTX 3090 or V100), 120GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O. Also, you should first install some necessary package as follows:

Python >= 3.6
PyTorch >= 1.4 with Cuda >=10.2
torchvision >= 0.5.0
Cython

# git clone
$ git clone https://github.com/MILVLG/rosita.git 

# build essential utils
$ cd rosita/rosita/utils/rec
$ python setup.py build
$ cp build/lib*/bbox.cpython*.so .

Dataset Setup

To download the required datasets to run this project, please check datasets.md for details.

Pretraining

Please check pretrain.md for the details for ROSITA pretraining. We currently only provide the pretrained model to run finetuning on downstream tasks. The codes to run pretraining will be released later.

Finetuning

Please check finetune.md for the details for finetuning on downstream tasks. Scripts to run finetuning on downstream tasks are provided. Also, we provide trained models that can be directly evaluated to reproduce the results.

Demo

We provide the Jupyter notebook scripts for reproducing the visualization results shown in our paper.

Acknowledgment

We appreciate the well-known open-source projects such as LXMERT, UNITER, OSCAR, and Huggingface, which help us a lot when writing our codes.

Yuhao Cui (@cuiyuhao1996) and Tong-An Luo (@Zoroaster97) are the main contributors to this repository. Please kindly contact them if you find any issue.

Citations

Please consider citing this paper if you use the code:

@inProceedings{cui2021rosita,
  title={ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration},
  author={Cui, Yuhao and Yu, Zhou and Wang, Chunqi and Zhao, Zhongzhou and Zhang, Ji and Wang, Meng and Yu, Jun},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO R@1 \| R@5 \| R@10}}	^{_{TR-COCO R@1 \| R@5 \| R@10}}	^{_{IR-Flickr R@1 \| R@5 \| R@10}}	^{_{TR-Flickr R@1 \| R@5 \| R@10}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

MILVLG / rosita

Programming Languages

Labels

Projects that are alternatives of or similar to rosita

ROSITA

News & Updates

Introduction

Performance

Installation

Software and Hardware Requirements

Dataset Setup

Pretraining

Finetuning

Demo

Acknowledgment

Citations