All Projects → DRSY → MoTIS

DRSY / MoTIS

Licence: other
Mobile(iOS) Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP). Accepted at NAACL 2022.

Programming Languages

swift
15916 projects
C++
36643 projects - #6 most used programming language
Objective-C++
1391 projects
objective c
16641 projects - #2 most used programming language
ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to MoTIS

cherche
📑 Neural Search
Stars: ✭ 196 (+226.67%)
Mutual labels:  retrieval, semantic-search, vector-search
skmeans
Super fast simple k-means implementation for unidimiensional and multidimensional data.
Stars: ✭ 59 (-1.67%)
Mutual labels:  k-means, k-means-clustering
MachineLearningSeries
Vídeos e códigos do Universo Discreto ensinando o fundamental de Machine Learning em Python. Para mais detalhes, acompanhar a playlist listada.
Stars: ✭ 20 (-66.67%)
Mutual labels:  k-means, k-means-clustering
text-cluster
🍡 文本聚类 k-means算法及实战
Stars: ✭ 40 (-33.33%)
Mutual labels:  k-means, k-means-clustering
natural-language-joint-query-search
Search photos on Unsplash based on OpenAI's CLIP model, support search with joint image+text queries and attention visualization.
Stars: ✭ 143 (+138.33%)
Mutual labels:  image-search, clip
Jina
Cloud-native neural search framework for 𝙖𝙣𝙮 kind of data
Stars: ✭ 12,618 (+20930%)
Mutual labels:  image-search, semantic-search
ClusterAnalysis.jl
Cluster Algorithms from Scratch with Julia Lang. (K-Means and DBSCAN)
Stars: ✭ 22 (-63.33%)
Mutual labels:  k-means, k-means-clustering
Milvus
An open-source vector database for embedding similarity search and AI applications.
Stars: ✭ 9,015 (+14925%)
Mutual labels:  image-search, vector-search
pqlite
⚡ A fast embedded library for approximate nearest neighbor search
Stars: ✭ 141 (+135%)
Mutual labels:  image-search, vector-search
img classification deep learning
No description or website provided.
Stars: ✭ 19 (-68.33%)
Mutual labels:  image-search, knn
ALPR-Indonesia
Automatic license plate recognition for Indonesian plate (White on black)
Stars: ✭ 40 (-33.33%)
Mutual labels:  knn
UDLF
An Unsupervised Distance Learning Framework for Multimedia Retrieval
Stars: ✭ 40 (-33.33%)
Mutual labels:  retrieval
The-Supervised-Learning-Workshop
An Interactive Approach to Understanding Supervised Learning Algorithms
Stars: ✭ 24 (-60%)
Mutual labels:  knn
Fall-Detection-Dataset
FUKinect-Fall dataset was created using Kinect V1. The dataset includes walking, bending, sitting, squatting, lying and falling actions performed by 21 subjects between 19-72 years of age.
Stars: ✭ 16 (-73.33%)
Mutual labels:  knn
neural-compressor
Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool), targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
Stars: ✭ 666 (+1010%)
Mutual labels:  knowledge-distillation
spot price machine learning
Machine Learning for Spot Prices
Stars: ✭ 25 (-58.33%)
Mutual labels:  knn
fauxClip
Clipboard support for Vim without +clipboard
Stars: ✭ 32 (-46.67%)
Mutual labels:  clip
RETRO-pytorch
Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch
Stars: ✭ 473 (+688.33%)
Mutual labels:  retrieval
MHCLN
Deep Metric and Hash Code Learning Network for Content Based Retrieval of Remote Sensing Images
Stars: ✭ 30 (-50%)
Mutual labels:  retrieval
MutualGuide
Localize to Classify and Classify to Localize: Mutual Guidance in Object Detection
Stars: ✭ 97 (+61.67%)
Mutual labels:  knowledge-distillation

Mobile Text-to-Image Search(MoTIS)

MoTIS is a minimal demo demonstrating semantic multimodal text-to-image search using pretrained vision-language models. Semantic search represents each sample(text and image) as a vector in a shared semantic embedding space. The relevance score can then be measured as similarity(cosine similarity or distance) between vectors.

Recent Updates:

  • The paper demonstrating the underlying compression algorithm of MoTIS has been accepted to NAACL 2022 main conference!
  • Android apk is availabel at here.
  • 2-layer text encoder is released.
  • Android version is coming soon.
  • 4-layer text encoder is released.
  • We distilled the text encoder into a 6-layer counterpart of the original 12-layer Transformer, the resulting combined dual-encoder achieves even better performance than the one combined using 12-layer Transformer!
  • We use pretrained ViT-Small(85MB) as initialization for the student model. Using the same distillation pipeline, it achieves even better results(2 points higher Hit@1) than the previous Deit-small-distilled model. Link of the jit scirpt checkpoint is here.
  • A more effective distilled image encoder(84MB compared to the original 350MB ViT-B/32 in CLIP) is available here. This image encoder is initialized with DeiT-base-distilled's pre-trained weights, which leads to more robust image representation hence better retrieval performance(obtain higher Hit@1/5/10 than original CLIP on MSCOCO validation set). It is further learned through supervised learning and knowledge distillation.
  • Transplanted Spotify's Annoy Approximate Nearest Neighbor search in this project(annoylib.h).
  • Before searching, all images in the gallery are displayed at relatively lower resolution to save memory. Meanwhile in the background, we take as input the high-resolution version of all images for encoding and indexing. When users actually start to search, the retrieved images are displayed at high resolution since we only display top-K search results.

Current Best Dual-Encoder TorchScript Files

Performance: These two combined achieves 40.4/68.5/78.4 R@1/R@5/R@10 on MS COCO 2014 5K test set, matching CLIP model(40.9/67.6/77.9) finetuned with contrastive loss. On the 1K test split, our current best compressed dual-encoder achieves 61.2/87.6/94.2 R@1/R@5/R@10, while CLIP obtains 61.0/87.9/94.7.

Inference Speed: The image encoder is approximately 1.6 times faster than CLIP's ViT/B-32, and the text encoder is about 2.9 times faster than CLIP's text encoder.

Distilled Text Encoder Checkpoints

Model Disk Space Google Drive R@10 on MS COCO2014 5K testset
original CLIP 224MB https://drive.google.com/file/d/1583IT_K9cCkeHfrmuTpMbImbS5qB8SA1/view?usp=sharing 64.5
fine-tuned CLIP 224MB - 77.9
6-Layer Transformer with hard negatives 170MB https://drive.google.com/file/d/1isMy64zuWnggd9K63RMHG4fx6U4O-izE/view?usp=sharing 79.4
4-Layer Transformer with hard negatives 146MB https://drive.google.com/file/d/1c83gD8NGT8v8RcE_E_rCrkqWN2RIzHEg/view?usp=sharing 79.0
2-Layer Transformer with hard negatives 121MB https://drive.google.com/file/d/1QdWJw_29MWQnb9SgClwbM_9iZquB9QKT/view?usp=sharing 78.4

Distilled Image Encoder Checkpoints

Model Disk Space Google Drive R@10 on MS COCO2014 5K testset
original CLIP 336MB https://drive.google.com/file/d/1K2wIyTuSWLTKBXzUlyTEsa4xXLNDuI7P/view?usp=sharing 64.5
fine-tuned CLIP 336MB - 77.9
ViT-small-patch16-224 85MB https://drive.google.com/file/d/1s_oX0-HIELpjjrBXsjlofIbTGZ_Wllo0/view?usp=sharing 68.9
ViT-small-patch16-224(larger batch size) 85MB https://drive.google.com/file/d/1h_w9msJMB4F-dR6uNwp-BHeguS5QIrnE/view?usp=sharing 68.3
ViT-small-patch16-224(arger batch size and hard negatives sampled from training set) 85MB https://drive.google.com/file/d/14AqCaORjxePrscdwUTGprII8siJ7ik8X/view?usp=sharing 69.4
ViT-small-patch16-224(larger batch size, bigger image corpus, and hard negatives sampled from training set) 85MB https://drive.google.com/file/d/1q3dllreyVTofWh5JZywzWYHQlNgcRacq/view?usp=sharing 69.9
ViT-small-patch16-224-ImageNet21K(larger batch size, bigger image corpus, and hard negatives sampled from training set) 85MB https://drive.google.com/file/d/1Whacd4qeFuP_sair3yNGUeQTm4bshDYh/view?usp=sharing 75.3

Note that these checkpoints are not taken from state_dict(), but rather after torch.jit.script operation. The same original CLIP text encoder is used for all above image encoders.

Features

  1. text-to-image retrieval using semantic similarity search.
  2. support different vector indexing strategies(linear scan, KMeans, and random projection).

Screenshot

  • Before search, all images in the gallery(left)  =>    After searching with query Three cats(right):
  •          

Installation

  1. Download the two TorchScript model files(text encoder, image encoder) into models folder and add them into the Xcode project.
  2. Required dependencies are defined in the Podfile. We use Cocapods to manage these dependencies. Simply do 'pod install' and then open the generated .xcworkspace project file in XCode.
pod install
  1. This demo by default load all images in the local photo gallery on your realphone or simulator. One can change it to a specified album by setting the albumName variable in getPhotos method and replacing assetResults in line 117 of GalleryInteractor.swift with photoAssets.

Usage

Just type any keyword in order to search the relecant images. Type "reset" to return to the default one.

Todos

  • Basic features
  • Access to specified album or all photos
  • Asynchronous model loading and vectors computation
  • Export pretrinaed CLIP into TorchScript format using torch.jit.script and optimize_for_mobile provided by Pytorch
  • Transplant the original PIL based image preprocessing procedure into OpenCV based procedure, observed about 1% retrieval performance degradation
  • Transplant the CLIP tokenizer from Python into Swift(Tokenizer.swift)
  • Indexing strategies
  • Linear indexing(persisted to file via built-in Data type)
  • KMeans indexing(persisted to file via NSMutableDictionary, hard-coded num of clusters, u can change to whatever u want)
  • Spotify's Annoy libraby with random projection indexing, the size of index file is 41MB for 2200 images.
  • Choices of semantic representation models
  • OpenAI's CLIP model
  • Integration of other multimodal retrieval models
  • Effiency
  • Reducing memory consumption of models: runtime memory 1GB -> 490MB via a smaller yet effective distilled ViT model.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].