tsujuifu / pytorch_violet

Licence: other

A PyTorch implementation of VIOLET

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pytorch violet

[TPAMI Special Issue on ICCV 2021 Best Papers, Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Stars: ✭ 57 (-52.1%)

Mutual labels: vision-and-language, pre-training, video-question-answering

rosita

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Stars: ✭ 36 (-69.75%)

Mutual labels: vision-and-language, pre-training

X-VLM

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

Stars: ✭ 283 (+137.82%)

Mutual labels: vision-and-language

lang2seg

Referring Expression Object Segmentation with Caption-Aware Consistency, BMVC 2019

Stars: ✭ 30 (-74.79%)

Mutual labels: vision-and-language

clip playground

An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities

Stars: ✭ 80 (-32.77%)

Mutual labels: vision-and-language

TRAR-VQA

[ICCV 2021] TRAR: Routing the Attention Spans in Transformers for Visual Question Answering -- Official Implementation

Stars: ✭ 49 (-58.82%)

Mutual labels: vision-and-language

MIA

Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" （NeurIPS 2019）

Stars: ✭ 57 (-52.1%)

Mutual labels: vision-and-language

VarCLR

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Stars: ✭ 30 (-74.79%)

Mutual labels: pre-training

moment detr

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset

Stars: ✭ 143 (+20.17%)

Mutual labels: video-retrieval

stanford-cs231n-assignments-2020

This repository contains my solutions to the assignments for Stanford's CS231n "Convolutional Neural Networks for Visual Recognition" (Spring 2020).

Stars: ✭ 84 (-29.41%)

Mutual labels: vision-and-language

iMIX

A framework for Multimodal Intelligence research from Inspur HSSLAB.

Stars: ✭ 21 (-82.35%)

Mutual labels: vision-and-language

wikiHow paper list

A paper list of research conducted based on wikiHow

Stars: ✭ 25 (-78.99%)

Mutual labels: vision-and-language

robo-vln

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Stars: ✭ 34 (-71.43%)

Mutual labels: vision-and-language

calvin

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Stars: ✭ 105 (-11.76%)

Mutual labels: vision-and-language

Kaleido-BERT

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.

Stars: ✭ 252 (+111.76%)

Mutual labels: pre-training

SIGIR2021 Conure

One Person, One Model, One World: Learning Continual User Representation without Forgetting

Stars: ✭ 23 (-80.67%)

Mutual labels: pre-training

synse-zsl

Official PyTorch code for the ICIP 2021 paper 'Syntactically Guided Generative Embeddings For Zero Shot Skeleton Action Recognition'

Stars: ✭ 14 (-88.24%)

Mutual labels: vision-and-language

VQ-APC

Vector Quantized Autoregressive Predictive Coding (VQ-APC)

Stars: ✭ 34 (-71.43%)

Mutual labels: pre-training

distill-and-select

Authors official PyTorch implementation of the "DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval" [IJCV 2022]

Stars: ✭ 43 (-63.87%)

Mutual labels: video-retrieval

VidSitu

[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

Stars: ✭ 41 (-65.55%)

Mutual labels: vision-and-language

View All Similar Projects ➔

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

Here is our best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

Multiple-Choice Question Answering (TGIF-Action, TGIF-Transition, MSRVTT-MC, and LSMDC-MC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json

Open-Ended Question Answering (TGIF-Frame, MSRVTT-QA, LSMDC-FiB, and MSVD-QA)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json

Text-to-Video Retrieval (MSRVTT, DiDeMo, YouCook2, and LSMDC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all trained downstream checkpoints.

Citation

@inproceedings{fu2022empirical-mvm, 
  author = {Tsu-Jui Fu* and Linjie Li* and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling}, 
  booktitle = {arXiv:2209.01540}, 
  year = {2022} 
}

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

tsujuifu / pytorch_violet

Programming Languages

Labels

Projects that are alternatives of or similar to pytorch violet

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

Overview

Requirements

Usage

Data preprocessing

Pretraining

Downstream

Citation