Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

jayleicn / Clipbert

Licence: mit

[CVPR 2021 Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning for image-text and video-text tasks.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

pytorch vqa

Projects that are alternatives of or similar to Clipbert

Awesome Visual Question Answering

A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.

Stars: ✭ 295 (+75.6%)

Mutual labels: vqa

Visual Question Answering

📷 ❓ Visual Question Answering Demo and Algorithmia API

Stars: ✭ 18 (-89.29%)

Mutual labels: vqa

Vqa Tensorflow

Tensorflow Implementation of Deeper LSTM+ normalized CNN for Visual Question Answering

Stars: ✭ 98 (-41.67%)

Mutual labels: vqa

Oscar

Oscar and VinVL

Stars: ✭ 396 (+135.71%)

Mutual labels: vqa

Vqa.pytorch

Visual Question Answering in Pytorch

Stars: ✭ 602 (+258.33%)

Mutual labels: vqa

Bottom Up Attention

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

Stars: ✭ 989 (+488.69%)

Mutual labels: vqa

MICCAI21 MMQ

Multiple Meta-model Quantifying for Medical Visual Question Answering

Stars: ✭ 16 (-90.48%)

Mutual labels: vqa

Vqa Mfb

Stars: ✭ 153 (-8.93%)

Mutual labels: vqa

Bottom Up Attention Vqa

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Stars: ✭ 667 (+297.02%)

Mutual labels: vqa

Mullowbivqa

Hadamard Product for Low-rank Bilinear Pooling

Stars: ✭ 57 (-66.07%)

Mutual labels: vqa

Awesome Vqa

Visual Q&A reading list

Stars: ✭ 403 (+139.88%)

Mutual labels: vqa

Mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Stars: ✭ 4,713 (+2705.36%)

Mutual labels: vqa

Conditional Batch Norm

Pytorch implementation of NIPS 2017 paper "Modulating early visual processing by language"

Stars: ✭ 51 (-69.64%)

Mutual labels: vqa

Tbd Nets

PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"

Stars: ✭ 345 (+105.36%)

Mutual labels: vqa

Papers

读过的CV方向的一些论文，图像生成文字、弱监督分割等

Stars: ✭ 99 (-41.07%)

Mutual labels: vqa

Nscl Pytorch Release

PyTorch implementation for the Neuro-Symbolic Concept Learner (NS-CL).

Stars: ✭ 276 (+64.29%)

Mutual labels: vqa

Vizwiz Vqa Pytorch

PyTorch VQA implementation that achieved top performances in the (ECCV18) VizWiz Grand Challenge: Answering Visual Questions from Blind People

Stars: ✭ 33 (-80.36%)

Mutual labels: vqa

Pytorch Vqa

Strong baseline for visual question answering

Stars: ✭ 158 (-5.95%)

Mutual labels: vqa

Vqa regat

Research Code for ICCV 2019 paper "Relation-aware Graph Attention Network for Visual Question Answering"

Stars: ✭ 129 (-23.21%)

Mutual labels: vqa

Vqa

CloudCV Visual Question Answering Demo

Stars: ✭ 57 (-66.07%)

Mutual labels: vqa

View All Similar Projects ➔

ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021, Oral.

Jie Lei*, Linjie Li*, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning for image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning. In this repository, we support end-to-end pretraining and finetuning for the following tasks:

Image-text pretraining on COCO and VG captions.
Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
Video-QA finetuning on TGIF-QA and MSRVTT-QA.
Image-QA finetuning on VQA 2.0.

It is also feasible and easy to add other image-text or video-text tasks for pretraining and finetuning.

ClipBERT is accepted as an oral paper in CVPR 2021 with 3 strong accepts. 😍

Requirements

We provide a Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

Create a folder that stores pretrained models, all the data, and results.

PATH_TO_STORAGE=/path/to/your/data/
mkdir -p $PATH_TO_STORAGE/txt_db  # annotations
mkdir -p $PATH_TO_STORAGE/vis_db  # image and video 
mkdir -p $PATH_TO_STORAGE/finetune  # finetuning results
mkdir -p $PATH_TO_STORAGE/pretrained  # pretrained models

Download pretrained models.

Our e2e pretrained ClipBERT model (849MB), can be downloaded with the following command.
```
bash scripts/download_pretrained.sh $PATH_TO_STORAGE
```
This pretrained model can be used for finetuning on video-text tasks and image-text tasks. For your convenience, this script will also download bert-base-uncased and grid-feat-vqa model weights, which are used as initialization for pretraining.
Launch the Docker container for running the experiments.
```
# docker image should be automatically pulled
source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
    $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained
```
The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /clipbert instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Downstream Task Finetuning

Text-to-Video Retrieval

Tasks: MSRVTT retrieval, DiDeMo and ActivityNet Captions paragprah-to-video retrieval, MSRVTT MC Test.

Download data.

# outside the container  
# download videos + annotations for $DSET
bash scripts/download_$DSET.sh $PATH_TO_STORAGE

$DSET can be one of msrvtt, didemo, anet.

Finetuning.

# inside the container
horovodrun -np 4 python src/tasks/run_video_retrieval.py \
    --config $CONFIG_PATH \
    --output_dir $OUTPUT_DIR

# for single GPU
python src/tasks/run_video_retrieval.py \
    --config $CONFIG_PATH \
    --output_dir $OUTPUT_DIR

$CONFIG_PATH should be set to one of the .json config files available at src/configs prefixed with _ret. For example, you can use src/configs/msrvtt_ret_base_resnet50.json for MSRVTT retrieval.

Run inference.
```
# inside the container
horovodrun -np 4 python src/tasks/run_video_retrieval.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db $TXT_DB \
  --inference_img_db $IMG_DB --inference_batch_size 64 \
  --inference_n_clips $INFERENCE_N_CLIPS
```
$STEP is an integer, it tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_retrieval_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT retrieval val split. The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

After MSRVTT retrieval model is trained, you can use it for inference for the MSRVTT MC Test task as well, which is essentially a retrieval task in a multiple-choice task setup.
```
# inside the container
horovodrun -np 4 python src/tasks/run_msrvtt_mc.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db /txt/downstream/msrvtt_retrieval_mc/msrvtt_retrieval_mc_test.jsonl \
  --inference_img_db /img/msrvtt --inference_batch_size 64 \
  --inference_n_clips $INFERENCE_N_CLIPS
```

Video Question Answering

Tasks: TGIF-QA action, transition, and frameQA tasks; MSRVTT-QA.

Download data.

# outside the container  
# download MSRVTT videos, and QA + retrieval annotations
bash scripts/download_msrvtt.sh $PATH_TO_STORAGE  
# download TGIF-QA videos and annotations
bash scripts/download_tgif_qa.sh $PATH_TO_STORAGE

Finetuning.
```
# inside the container
horovodrun -np 4 python src/tasks/run_video_qa.py \
    --config $CONFIG_PATH \
    --output_dir $OUTPUT_DIR
```
$CONFIG_PATH should be set to one of the .json config files available at src/configs contains the substring _qa. For example, you can use src/configs/msrvtt_qa_base_resnet50.json for MSRVTT-QA.
Run inference.
```
# inside the container
horovodrun -np 4 python src/tasks/run_video_qa.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db $TXT_DB \
  --inference_img_db $IMG_DB --inference_batch_size 64 \
  --inference_n_clips $INFERENCE_N_CLIPS
```
$STEP is an integer, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_qa_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT QA val split.

The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

Image Question Answering (VQA)

Download data

# outside the container
# download COCO and VG data
bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
# download VQA annotations
bash scripts/download_vqa.sh $PATH_TO_STORAGE

Finetuning

# inside the container
horovodrun -np 4 python src/tasks/run_vqa.py \
    --config src/configs/vqa_base_resnet50.json \
    --output_dir $OUTPUT_DIR

Inference

# inside the container
horovodrun -np 4 python src/tasks/run_vqa.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db $TXT_DB \
  --inference_img_db $IMG_DB \
  --inference_batch_size 64

Pretraining

Download data

# outside the container
bash scripts/download_coco_vg.sh $PATH_TO_STORAGE

Pretraining

#inside the container
horovodrun -np 8 python src/pretrain/run_pretrain.py \
    --config src/configs/pretrain_indomain_base_resnet50_mlm_itm.json \
    --output_dir $OUTPUT_DIR

Data Preprocessing

ClipBERT takes raw video and text as inputs, there is no need to do feature extraction. However, to improve data loading speed, we use LMDB to store the raw image and video files. You can use the following script to convert a list of videos with file extensions mp4 and avi into LMDB:

# outside the container
python src/preprocessing/file2lmdb.py \
    --data_root /path/to/videos \
    --lmdb_save_dir /path/to/save/lmdb \
    --ext avi mp4 \
    --file_type video

For images, use appropriate file extensions for --ext and --file_type image. Text annotation files are reorganized into jsonl files, see example preprocessed files downloaded by the scripts in scripts/.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{lei2021less,
  title={Less is More: ClipBERT for Video-and-Language Learningvia Sparse Sampling},
  author={Lei, Jie and Li, Linjie and Zhou, Luowei and Gan, Zhe and Berg, Tamara L. and Bansal, Mohit and Liu, Jingjing},
  booktitle={CVPR},
  year={2021}
}

Acknowledgement

We thank Yen-Chun Chen, Ruotian Luo, and other members and interns at Microsoft Multimodal AI for their helpful discussions. We also thank the anonymous reviewers for their constructive feedback.

This code used resources from transformers, UNITER, HERO, grid-feats-vqa, SlowFast, Detectron2. The code is implemented using PyTorch, with multi-GPU support from Horovod and mixed precision support from apex. We thank the authors for open-sourcing their awesome projects.

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 168

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗