All Projects → mmaaz60 → mvits_for_class_agnostic_od

mmaaz60 / mvits_for_class_agnostic_od

Licence: MIT license
Multi-modal Transformers Excel at Class-agnostic Object Detection

Programming Languages

python
139335 projects - #7 most used programming language
Cuda
1817 projects
shell
77523 projects
C++
36643 projects - #6 most used programming language

Projects that are alternatives of or similar to mvits for class agnostic od

TopicNet
Interface for easier topic modelling.
Stars: ✭ 127 (+7.63%)
Mutual labels:  multimodal-learning
MSAF
Offical implementation of paper "MSAF: Multimodal Split Attention Fusion"
Stars: ✭ 47 (-60.17%)
Mutual labels:  multimodal-learning
pykale
Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the 🔥PyTorch ecosystem
Stars: ✭ 381 (+222.88%)
Mutual labels:  multimodal-learning
slp
Utils and modules for Speech Language and Multimodal processing using pytorch and pytorch lightning
Stars: ✭ 17 (-85.59%)
Mutual labels:  multimodal-learning
vista-net
Code for the paper "VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis", AAAI'19
Stars: ✭ 67 (-43.22%)
Mutual labels:  multimodal-learning
factorized
[ICLR 2019] Learning Factorized Multimodal Representations
Stars: ✭ 49 (-58.47%)
Mutual labels:  multimodal-learning
awesome-multimodal-ml
Reading list for research topics in multimodal machine learning
Stars: ✭ 3,125 (+2548.31%)
Mutual labels:  multimodal-learning
CoVA-Web-Object-Detection
A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!
Stars: ✭ 18 (-84.75%)
Mutual labels:  multimodal-learning
just-ask
[TPAMI Special Issue on ICCV 2021 Best Papers, Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Stars: ✭ 57 (-51.69%)
Mutual labels:  multimodal-learning
multimodal-vae-public
A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335)
Stars: ✭ 98 (-16.95%)
Mutual labels:  multimodal-learning
TFLite-Mobile-Generic-Object-Localizer
Python TFLite scripts for detecting objects of any class in an image without knowing their label.
Stars: ✭ 42 (-64.41%)
Mutual labels:  class-agnostic-detection
ONNX-ImageNet-1K-Object-Detector
Python scripts for performing object detection with the 1000 labels of the ImageNet dataset in ONNX. The repository combines a class agnostic object localizer to first detect the objects in the image, and next a ResNet50 model trained on ImageNet is used to label each box.
Stars: ✭ 18 (-84.75%)
Mutual labels:  class-agnostic-detection

MViTs Excel at Class-agnostic Object Detection

PWC PWC PWC PWC PWC

PWC PWC PWC PWC

PWC PWC

Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer and Ming-Hsuan Yang

Paper: https://arxiv.org/abs/2111.11430

🚀 News

  • (Feb 01, 2022)
    • Training codes for MDef-DETR and MDef-DETR minus Language models are released -> training/README.md
    • Instructions to use class-agnostic object detection behavior of MDef-DETR on different applications are released -> applications/README.md
    • All the pretrained models (MDef-DETR, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others), along with the instructions to reproduce the results are released -> this link
  • (Nov 25, 2021) Evaluation code along with pre-trained models & pre-computed predictions is released. evaluation/README.md

main figure

Abstract: What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and for unseen objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. To bridge this gap, we explore recent Multi-modal Vision Transformers (MViT) that have been trained with aligned image-text pairs. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on these findings, we develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs offer enhanced interactability with intelligible text queries.


Architecture overview of MViTs used in this work

Architecture overview


Installation

The code is tested with PyTorch 1.8.0 and CUDA 11.1. After cloning the repository, follow the below steps for installation,

  1. Install PyTorch and torchvision
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
  1. Install other dependencies
pip install -r requirements.txt
  1. Compile Deformable Attention modules
cd models/ops
sh make.sh

Results


Class-agnostic OD performance of MViTs in comparison with uni-modal detector (RetinaNet) on several datasets. MViTs show consistently good results on all datasets.

Results


Enhanced Interactability: Effect of using different intuitive text queries on the MDef-DETR class-agnostic OD performance. Combining detections from multiple queries captures varying aspects of objectness.

Results


Language Skeleton/Structure: Experimental analysis to explore the contribution of language by removing all textual inputs, but maintaining the structure introduced by captions. All experiments are performed on Def-DETR. In setting 1, annotations corresponding to same images are combined. Setting 2 has an additional NMS applied to remove duplicate boxes. In setting 3, four to eight boxes are randomly grouped in each iteration. The same model is trained longer in setting 4. In setting 5, the dataloader structure corresponding to captions is kept intact. Results from setting 5 demonstrate the importance of structure introduced by language.

Results


Generalization to Rare/Novel Classes: MDef-DETR class-agnostic OD performance on rarely and frequently occurring categories in the pretraining captions. The numbers on top of the bars indicate occurrences of the corresponding category in the training dataset. The MViT achieves good recall values even for the classes with no or very few occurrences.

Results


Open-world Object Detection: Effect of using class-agnostic OD proposals from MDef-DETR for pseudo labelling of unknowns in Open World Detector (ORE).

Results


Pretraining for Class-aware Object Detection: Effect of using MDef-DETR proposals for pre-training of DETReg instead of Selective Search proposals.

Results


Evaluation

Please refer to evaluation/class_agnostic_od/README.md.


Training

Please refer to training/README.md.

Applications

Please refer to applications/README.md.

Citation

If you use our work, please consider citing:

    @article{Maaz2021Multimodal,
        title={Multi-modal Transformers Excel at Class-agnostic Object Detection},
        author={Muhammad Maaz and Hanoona Rasheed and Salman Khan and Fahad Shahbaz Khan and Rao Muhammad Anwer and Ming-Hsuan Yang},
        journal={ArXiv 2111.11430},
        year={2021}
    }

Contact

Should you have any question, please create an issue on this repository or contact at [email protected], [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].