All Projects → zichengsaber → LAVT-pytorch

zichengsaber / LAVT-pytorch

Licence: MIT License
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to LAVT-pytorch

SCINet
Forecast time series and stock prices with SCINet
Stars: ✭ 28 (+75%)
Mutual labels:  state-of-the-art
RSTNet
RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words (CVPR 2021)
Stars: ✭ 71 (+343.75%)
Mutual labels:  multimodal
hardware-attacks-state-of-the-art
Microarchitectural exploitation and other hardware attacks.
Stars: ✭ 29 (+81.25%)
Mutual labels:  state-of-the-art
strollr2d icassp2017
Image Denoising Codes using STROLLR learning, the Matlab implementation of the paper in ICASSP2017
Stars: ✭ 22 (+37.5%)
Mutual labels:  state-of-the-art
nemar
[CVPR2020] Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation
Stars: ✭ 120 (+650%)
Mutual labels:  multimodal
WearableSensorData
This repository provides the codes and data used in our paper "Human Activity Recognition Based on Wearable Sensor Data: A Standardization of the State-of-the-Art", where we implement and evaluate several state-of-the-art approaches, ranging from handcrafted-based methods to convolutional neural networks.
Stars: ✭ 65 (+306.25%)
Mutual labels:  state-of-the-art
fairytale
encode.ru community archiver
Stars: ✭ 29 (+81.25%)
Mutual labels:  state-of-the-art
NER-Multimodal-pytorch
Pytorch Implementation of "Adaptive Co-attention Network for Named Entity Recognition in Tweets" (AAAI 2018)
Stars: ✭ 42 (+162.5%)
Mutual labels:  multimodal
best AI papers 2021
A curated list of the latest breakthroughs in AI (in 2021) by release date with a clear video explanation, link to a more in-depth article, and code.
Stars: ✭ 2,740 (+17025%)
Mutual labels:  state-of-the-art
Diverse-Structure-Inpainting
CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"
Stars: ✭ 131 (+718.75%)
Mutual labels:  multimodal
iMIX
A framework for Multimodal Intelligence research from Inspur HSSLAB.
Stars: ✭ 21 (+31.25%)
Mutual labels:  multimodal
Deep-multimodal-subspace-clustering-networks
Tensorflow implementation of "Deep Multimodal Subspace Clustering Networks"
Stars: ✭ 62 (+287.5%)
Mutual labels:  multimodal
CompareModels TRECQA
Compare six baseline deep learning models on TrecQA
Stars: ✭ 61 (+281.25%)
Mutual labels:  state-of-the-art
lipnet
LipNet with gluon
Stars: ✭ 16 (+0%)
Mutual labels:  multimodal
docarray
The data structure for unstructured data
Stars: ✭ 561 (+3406.25%)
Mutual labels:  multimodal
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Stars: ✭ 1,173 (+7231.25%)
Mutual labels:  multimodal
delving-deeper-into-the-decoder-for-video-captioning
Source code for Delving Deeper into the Decoder for Video Captioning
Stars: ✭ 36 (+125%)
Mutual labels:  state-of-the-art
MVGL
TCyb 2018: Graph learning for multiview clustering
Stars: ✭ 26 (+62.5%)
Mutual labels:  multimodal
HugsVision
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision
Stars: ✭ 154 (+862.5%)
Mutual labels:  state-of-the-art
Recommender-Systems-with-Collaborative-Filtering-and-Deep-Learning-Techniques
Implemented User Based and Item based Recommendation System along with state of the art Deep Learning Techniques
Stars: ✭ 41 (+156.25%)
Mutual labels:  state-of-the-art

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Where we are ?

12.27 目前和原论文仍有1%左右得差距,但已经力压很多SOTA了

ckpt__448_epoch_25.pth mIoU Overall IoU [email protected]
Refcoco val 70.743 71.671 82.26
Refcoco testA 73.679 74.772 -
Refcoco testB 67.582 67.339 -

12.29 45epoch的结果又上升了大约1%

ckpt__448_epoch_45.pth mIoU Overall IoU
Refcoco val 71.949 72.246
Refcoco testA 74.533 75.467
Refcoco testB 67.849 68.123

the pretrain model will be released soon

对原论文的复现

论文链接: https://arxiv.org/abs/2112.02244

官方实现: https://github.com/yz93/LAVT-RIS

Architecture

Features

  • 将不同模态feature的fusion提前到Image Encoder阶段

  • 思路上对这两篇论文有很多借鉴

    • Vision-Language Transformer and Query Generation for Referring Segmentation

    • Locate then Segment: A Strong Pipeline for Referring Image Segmentation

  • 采用了比较新的主干网络 Swin-Transformer

Usage

详细参数设置可以见args.py

for training

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448

for evaluation

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node 4 --master_port 23458 main.py --size 448 --batch_size 1 --resume --eval --type val --eval_mode cat --pretrain ckpt_448_epoch_20.pth --cfg_file configs/swin_base_patch4_window7_224.yaml

*.pth 都放在./checkpoint

for resume from checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12346 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448 --resume --pretrain ckpt_448_epoch_10.pth

for dataset preparation

please get details from ./data/readme.md

Need to be finished

由于我在复现的时候,官方的code还没有出来,所以一些细节上的设置可能和官方code不同

  • Swin Transformer 我选择的是 swin_base_patch4_window12_384_22k.pth,具体代码可以参考官方代码 https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md 原论文中的图像resize的尺寸是480*480,可是我目前基于官方的代码若想调到这个尺寸,总是会报错,查了一下觉得可能用object detection 的swin transformer的code比较好

    12.27 这个问题目前也已经得到了较好的解决,目前训练用的是 swin_base_patch4_window7_224_22k.pth, 输入图片的尺寸调整到448*448

    解决方案可以参考:

    https://github.com/microsoft/Swin-Transformer/issues/155

  • 原论文中使用的lr_scheduler是polynomial learning rate decay, 没有给出具体的参数手动设置了一下

    12.21 目前来看感觉自己设置的不是很好

    12.27 调整了一下设置,初始学习率的设置真的很重要,特别是根据batch_size 去scale你的 inital learning rate

  • 原论文中的batch_size=32,基于自己的实验我猜想应该是用了8块GPU,每一块的batch_size=4, 由于我第一次写DDP code,训练时发现,程序总是会在RANK0上给其余RANK开辟类似共享显存的东西,导致我无法做到原论文相同的配置,需要改进

  • 仔细观察Refcoco的数据集,会发现一个target会对应好几个sentence,training时我设计的是随机选一个句子,evaluate时感觉应该要把所有句子用上会更好,关于这一点我想了两种evaluate的方法

    目前eval 只能支持 batch_size=1

    • 将所有句子concatenate成为一个句子,送入BERT,Input 形式上就是(Image,cat(sent_1,sent_2,sent_3)) => model => pred

    实验发现这种eval_mode 下的mean IOU 会好不少, overall_IOU 也会好一点

    • 对同一张图片处理多次处理,然后将结果进行平均,Input 形式上就是 ((Image,sent_1),(Image,sent_2),(Image,sent_3)) => model => average(pred_1,pred_2,pred_3)

Visualization

详细见inference.ipynb

input sentences

  1. right girl
  2. closest girl on right

results

Failure cases study

AnalysisFailure.ipynb 提供了一个研究model不work的途径,主要是筛选了IoU < 0.5的case,并在这些case中着重查看了一下IoU < 0.10.4 < IoU < 0.5 的例子

目前我只看了一些有限的failure cases,做了如下总结

  • 模型对于similar,dense object在language guide下定位不精确
  • 模型对于language的理解不分主次
  • refcoco本身标记的一些问题
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].