All Projects → zengyan-97 → X-VLM

zengyan-97 / X-VLM

Licence: BSD-3-Clause license
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to X-VLM

wikiHow paper list
A paper list of research conducted based on wikiHow
Stars: ✭ 25 (-91.17%)
Mutual labels:  vision-and-language
robo-vln
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Stars: ✭ 34 (-87.99%)
Mutual labels:  vision-and-language
MVGL
TCyb 2018: Graph learning for multiview clustering
Stars: ✭ 26 (-90.81%)
Mutual labels:  multimodality
VidSitu
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)
Stars: ✭ 41 (-85.51%)
Mutual labels:  vision-and-language
synse-zsl
Official PyTorch code for the ICIP 2021 paper 'Syntactically Guided Generative Embeddings For Zero Shot Skeleton Action Recognition'
Stars: ✭ 14 (-95.05%)
Mutual labels:  vision-and-language
rosita
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Stars: ✭ 36 (-87.28%)
Mutual labels:  vision-and-language
calvin
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Stars: ✭ 105 (-62.9%)
Mutual labels:  vision-and-language
lang2seg
Referring Expression Object Segmentation with Caption-Aware Consistency, BMVC 2019
Stars: ✭ 30 (-89.4%)
Mutual labels:  vision-and-language
MIA
Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" (NeurIPS 2019)
Stars: ✭ 57 (-79.86%)
Mutual labels:  vision-and-language
emmental
A deep learning framework for building multimodal multi-task learning systems.
Stars: ✭ 93 (-67.14%)
Mutual labels:  multimodality
iMIX
A framework for Multimodal Intelligence research from Inspur HSSLAB.
Stars: ✭ 21 (-92.58%)
Mutual labels:  vision-and-language
clip playground
An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities
Stars: ✭ 80 (-71.73%)
Mutual labels:  vision-and-language
fuse-med-ml
A python framework accelerating ML based discovery in the medical field by encouraging code reuse. Batteries included :)
Stars: ✭ 66 (-76.68%)
Mutual labels:  multimodality
CBP
Official Tensorflow Implementation of the AAAI-2020 paper "Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction"
Stars: ✭ 52 (-81.63%)
Mutual labels:  vision-and-language
FEDOT
Automated modeling and machine learning framework FEDOT
Stars: ✭ 312 (+10.25%)
Mutual labels:  multimodality
TRAR-VQA
[ICCV 2021] TRAR: Routing the Attention Spans in Transformers for Visual Question Answering -- Official Implementation
Stars: ✭ 49 (-82.69%)
Mutual labels:  vision-and-language
just-ask
[TPAMI Special Issue on ICCV 2021 Best Papers, Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Stars: ✭ 57 (-79.86%)
Mutual labels:  vision-and-language
pytorch violet
A PyTorch implementation of VIOLET
Stars: ✭ 119 (-57.95%)
Mutual labels:  vision-and-language
clip-guided-diffusion
A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI.
Stars: ✭ 260 (-8.13%)
Mutual labels:  multimodality
VectorNet
Pytorch implementation of CVPR2020 paper “VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation”
Stars: ✭ 88 (-68.9%)
Mutual labels:  multimodality

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

  • Nov 2022: Release X2-VLM: All-In-One for Vision Language Tasks; All-In-One == Image + Video + Transfer to Other Languages / Domains
  • May 2022: The paper has been accepted by ICML 2022
  • Jan 2022: Release official PyTorch implementation and X-VLM checkpoints
  • Nov 2021: Release preprint in arXiv

X-VLM (216M parameters: swin-base + 6L text + 6L cross): PWC PWC PWC PWC PWC PWC PWC PWC PWC

Hiring

We are looking for interns / FTEs at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to [email protected].

Features

  • Support several backbones
    • vision encoder: deit / clip-vit / swin-transformer
    • text encoder: bert / roberta
  • Support apex O1 / O2 for pre-training
  • Read from and write to HDFS
  • Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

  • Install python3 environment
pip3 install -r requirements.txt
  • Download raw images from corresponding websites
  • Download the json files we provided, which contains image read paths and captions and/or bbox annotations
  • If running pre-training scripts:
  • Organize these files like this (% is for pre-training only):
X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details. To make a fair comparison of some recent works, we pre-trained X-VLM (4M/16M) for 200K steps.

Data


🌟UPDATE: our multi-lingual multi-modal project Cross-View Language Modeling released the text of COCO+VG+SBU+CC3M and Object And Region Annotations in six languages. You can use english text for X-VLM pre-training.


All datasets we utilized are publicly available. We cannot re-distribute the data. So, please prepare the pre-training data by yourself. Here, we provide some data examples. Read the code dataset/pretrain_dataset.py/ImageTextJsonDataset & RegionTextJsonDataset for details.

# image-captions pairs, providing 'binary' or 'image_rpath' 
{'caption': 'dog on bike in harajuku', 
 'binary': binary_encoding_of_the_image, 
 'image_rpath': local_rpath_of_the_image
}


# object/region annotations, providing 'binary' or 'image_rpath' 
{'elems': [{'caption': 'lady sitting at table that has pizza on it',  # str or list of str  
            'bb': [155, 0, 205, 131]   # (x, y, w, h)
            }, 
           {'caption': 'window',  
            'attributes': 'closed',  # str or list of str 
            'bb': [20, 130, 335, 185]
            },
          ]
 'caption': if_exist,  # str or list of str 
 'binary': binary_encoding_of_the_image, 
 'image_rpath': local_rpath_of_the_image
}

Checkpoints

X-VLM (4M, 200K steps)
X-VLM (16M, 200K steps)

Finetune

Datasets for finetuning and checkpoints of X-VLM (4M/16M) can be downloaded in following links.

Data

download json files

Checkpoints and Logs (16M)

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-weak
captioning-coco

Checkpoints and Logs (4M)

4m-all-ft-ckpts.tar

Examples

# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"

# train: if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results; it is only required by vqa & refcoco 
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th"

Specify "--task" to finetune on image-text retrieval, nlvr2, visual grounding, or image captioning. See run.py for details.

More Examples of Captioning:

# adapt cross-modal encoder + MLM head -> lm decoder; subsequent fine-tuning is included   
python3 run.py --task "coco_capt_domain" --dist "1" --output_dir "output/coco_capt_domain" --checkpoint "4m_base_model_state_step_199999.th"

# fine-tune only; evaluate is included 
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning" --checkpoint "4m_base_finetune/coco_caption/lm_domain_pretrain.th"
# evaluate only
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning" --evaluate --checkpoint "4m_base_finetune/coco_caption/coco_capt_ft_epoch_4.th"

# further CIDEr optimization; evaluate is included 
python3 run.py --task "coco_captioning_scst" --dist "1" --output_dir "output/coco_captioning_scst" --checkpoint "4m_base_finetune/coco_caption/coco_capt_ft_epoch_4.th"
# evaluate only
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning_scst" --evaluate --checkpoint "4m_base_finetune/coco_caption/coco_capt_cider_step_41000.th"

To make a fair comparison, we follow the previous works for fine-tuning. So, some scripts are based on ALBEF, OSCAR, and BLIP. We thank the authors for opening source their code.

Evaluation on VLUE

VLUE is a new OOD benchmark to evaluate vision-language models, which has been accepted by ICML2022.

python3 run.py --task "eval_vlue_itr" --dist "1" --evaluate  --output_dir "output/" --checkpoint "itr_coco/checkpoint_9.pth"

python3 run.py --task "eval_vlue_vqa" --dist "1" --evaluate  --output_dir "output/" --checkpoint "vqa/model_state_epoch_9.th"

python3 run.py --task "eval_vlue_nlvr" --dist "1" --evaluate  --output_dir "output/" --checkpoint "nlvr/nlvr_ft/checkpoint_best.pth"

python3 run.py --task "eval_vlue_refcoco" --dist "1" --evaluate  --output_dir "output/" --checkpoint "refcoco_bbox/checkpoint_best.pth"

python3 run.py --task "eval_vlue_refcoco_weakly" --dist "1" --evaluate  --output_dir "output/" --checkpoint "refcoco/checkpoint_best.pth"

Citation

If you find this repository useful, please considering giving or citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues using this code, please submit a GitHub issue.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].