All Projects → tsujuifu → pytorch_violet

tsujuifu / pytorch_violet

Licence: other
A PyTorch implementation of VIOLET

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pytorch violet

just-ask
[TPAMI Special Issue on ICCV 2021 Best Papers, Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Stars: ✭ 57 (-52.1%)
Mutual labels:  vision-and-language, pre-training, video-question-answering
rosita
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Stars: ✭ 36 (-69.75%)
Mutual labels:  vision-and-language, pre-training
X-VLM
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)
Stars: ✭ 283 (+137.82%)
Mutual labels:  vision-and-language
lang2seg
Referring Expression Object Segmentation with Caption-Aware Consistency, BMVC 2019
Stars: ✭ 30 (-74.79%)
Mutual labels:  vision-and-language
clip playground
An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities
Stars: ✭ 80 (-32.77%)
Mutual labels:  vision-and-language
TRAR-VQA
[ICCV 2021] TRAR: Routing the Attention Spans in Transformers for Visual Question Answering -- Official Implementation
Stars: ✭ 49 (-58.82%)
Mutual labels:  vision-and-language
MIA
Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" (NeurIPS 2019)
Stars: ✭ 57 (-52.1%)
Mutual labels:  vision-and-language
VarCLR
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning
Stars: ✭ 30 (-74.79%)
Mutual labels:  pre-training
moment detr
[NeurIPS 2021] Moment-DETR code and QVHighlights dataset
Stars: ✭ 143 (+20.17%)
Mutual labels:  video-retrieval
stanford-cs231n-assignments-2020
This repository contains my solutions to the assignments for Stanford's CS231n "Convolutional Neural Networks for Visual Recognition" (Spring 2020).
Stars: ✭ 84 (-29.41%)
Mutual labels:  vision-and-language
iMIX
A framework for Multimodal Intelligence research from Inspur HSSLAB.
Stars: ✭ 21 (-82.35%)
Mutual labels:  vision-and-language
wikiHow paper list
A paper list of research conducted based on wikiHow
Stars: ✭ 25 (-78.99%)
Mutual labels:  vision-and-language
robo-vln
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Stars: ✭ 34 (-71.43%)
Mutual labels:  vision-and-language
calvin
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Stars: ✭ 105 (-11.76%)
Mutual labels:  vision-and-language
Kaleido-BERT
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.
Stars: ✭ 252 (+111.76%)
Mutual labels:  pre-training
SIGIR2021 Conure
One Person, One Model, One World: Learning Continual User Representation without Forgetting
Stars: ✭ 23 (-80.67%)
Mutual labels:  pre-training
synse-zsl
Official PyTorch code for the ICIP 2021 paper 'Syntactically Guided Generative Embeddings For Zero Shot Skeleton Action Recognition'
Stars: ✭ 14 (-88.24%)
Mutual labels:  vision-and-language
VQ-APC
Vector Quantized Autoregressive Predictive Coding (VQ-APC)
Stars: ✭ 34 (-71.43%)
Mutual labels:  pre-training
distill-and-select
Authors official PyTorch implementation of the "DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval" [IJCV 2022]
Stars: ✭ 43 (-63.87%)
Mutual labels:  video-retrieval
VidSitu
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)
Stars: ✭ 41 (-65.55%)
Mutual labels:  vision-and-language

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

Here is our best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all trained downstream checkpoints.

Citation

@inproceedings{fu2022empirical-mvm, 
  author = {Tsu-Jui Fu* and Linjie Li* and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling}, 
  booktitle = {arXiv:2209.01540}, 
  year = {2022} 
}
@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].