All Projects → matsui528 → pqtable

matsui528 / pqtable

Licence: MIT license
Fast search algorithm for product-quantized codes via hash-tables

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
CMake
9771 projects
shell
77523 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to pqtable

quick-adc
Quick ADC
Stars: ✭ 20 (-58.33%)
Mutual labels:  nearest-neighbor-search, product-quantization
Scanns
A scalable nearest neighbor search library in Apache Spark
Stars: ✭ 190 (+295.83%)
Mutual labels:  nearest-neighbor-search
Fast Near Duplicate Image Search
Fast Near-Duplicate Image Search and Delete using pHash, t-SNE and KDTree.
Stars: ✭ 54 (+12.5%)
Mutual labels:  nearest-neighbor-search
Elastiknn
Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
Stars: ✭ 139 (+189.58%)
Mutual labels:  nearest-neighbor-search
Ggnn
GGNN: State of the Art Graph-based GPU Nearest Neighbor Search
Stars: ✭ 63 (+31.25%)
Mutual labels:  nearest-neighbor-search
Pgann
Fast Approximate Nearest Neighbor (ANN) searches with a PostgreSQL database.
Stars: ✭ 156 (+225%)
Mutual labels:  nearest-neighbor-search
Falconn
FAst Lookups of Cosine and Other Nearest Neighbors (based on fast locality-sensitive hashing)
Stars: ✭ 919 (+1814.58%)
Mutual labels:  nearest-neighbor-search
pqlite
⚡ A fast embedded library for approximate nearest neighbor search
Stars: ✭ 141 (+193.75%)
Mutual labels:  product-quantization
Tarsoslsh
A Java library implementing practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It implements Locality-sensitive Hashing (LSH) and multi index hashing for hamming space.
Stars: ✭ 179 (+272.92%)
Mutual labels:  nearest-neighbor-search
Knn Matting
Source Code for KNN Matting, CVPR 2012 / TPAMI 2013. MATLAB code ready to run. Simple and robust implementation under 40 lines.
Stars: ✭ 130 (+170.83%)
Mutual labels:  nearest-neighbor-search
Neighbor
Nearest neighbor search for Rails and Postgres
Stars: ✭ 114 (+137.5%)
Mutual labels:  nearest-neighbor-search
Annoy
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Stars: ✭ 9,262 (+19195.83%)
Mutual labels:  nearest-neighbor-search
Vald
Vald. A Highly Scalable Distributed Vector Search Engine
Stars: ✭ 158 (+229.17%)
Mutual labels:  nearest-neighbor-search
Awesome Cbir Papers
📝Awesome and classical image retrieval papers
Stars: ✭ 1,114 (+2220.83%)
Mutual labels:  nearest-neighbor-search
Mrpt
Fast and lightweight header-only C++ library (with Python bindings) for approximate nearest neighbor search
Stars: ✭ 210 (+337.5%)
Mutual labels:  nearest-neighbor-search
Deep Mihash
Code for papers "Hashing with Mutual Information" (TPAMI 2019) and "Hashing with Binary Matrix Pursuit" (ECCV 2018)
Stars: ✭ 13 (-72.92%)
Mutual labels:  nearest-neighbor-search
Nanopq
Pure python implementation of product quantization for nearest neighbor search
Stars: ✭ 145 (+202.08%)
Mutual labels:  nearest-neighbor-search
scikit-hubness
A Python package for hubness analysis and high-dimensional data mining
Stars: ✭ 41 (-14.58%)
Mutual labels:  nearest-neighbor-search
Rii
Fast and memory-efficient ANN with a subset-search functionality
Stars: ✭ 96 (+100%)
Mutual labels:  nearest-neighbor-search
Faiss tips
Some useful tips for faiss
Stars: ✭ 170 (+254.17%)
Mutual labels:  nearest-neighbor-search

PQTable

Product Quantization Table (PQTable) is a fast nearest neighbor search method for product-quantized codes via hash-tables. PQTable achieves one of the fastest search performances on a single CPU to date (2017) with significantly efficient memory usage (0.059 ms per query over 10^9 data points with just 5.5 GB memory consumption).

References:

  • Project page: http://yusukematsui.me/project/pqtable/pqtable.html

  • Y. Matsui, T. Yamasaki, and K. Aizawa, "PQTable: Nonexhaustive Fast Search for Product-Quantized Codes Using Hash Tables", IEEE Transactions on Multimedia 2018. [paper] (extended version of ICCV paper)

  • Y. Matsui, T. Yamasaki, and K. Aizawa, "PQTable: Fast Exact Asymmetric Distance Neighbor Search for Product Quantization using Hash Tables", ICCV 2015. [paper][supplementary]

Building

Requisites:

  • C++11
  • CMake
  • OpenCV 3.X
  • TCMalloc (optional, but strongly recommended). On ubuntu, you can install TCMalloc by sudo apt-get install libgoogle-perftools-dev. Then link it via the "-ltcmalloc" linker flag. This makes the search ~2x faster.

Build as usual with CMake:

$ git clone https://github.com/matsui528/pqtable.git
$ cd pqtable
$ mkdir build && cd build && cmake ..
$ make 

Testing

Demo using the siftsmall dataset

You can try a small demo using the siftsmall data. This does not take time. First, cd to the top directory of this project, then download the siftsmall vectors on data/.

$ bash scripts/download_siftsmall.sh 

Go to the bin directory, and run the demo

$ cd build/bin
$ ./demo_siftsmall

The program trains a product-quantizer, encodes vectors, builds PQTable, and runs the search. The final results will be somthing like:

...
93th query: nearest_id=6950, dist=49826.5
94th query: nearest_id=4053, dist=59649.2
95th query: nearest_id=7271, dist=76137.4
96th query: nearest_id=7512, dist=71723.1
97th query: nearest_id=7927, dist=68105
98th query: nearest_id=6289, dist=15346.3
99th query: nearest_id=8082, dist=54443.8
0.113859 [msec/query]

Note that these results (the nearest_ids and the distances) might be slightly different from yours because the training step includes a random process.

Demo using the sift1b dataset

You can try a large-scale demo using the sift1b data. This demo reproduces the results of Table 3 in the paper. Let's cd to the top directory of this project, and download the sift1b vectors on data/.

$ bash scripts/download_sift1b.sh 

This would take several hours. Since the data is large, ~250 GB disk space is required. Next, go to the bin dir, and run the code for training a product quantizer.

$ cd build/bin
$ ./demo_sift1b_train

This also takes several hours. If you do not care about the quality of the product quantizer and just try the demo, please change top_n in the demo_sift1b_train.cpp to the small number such as 10000, and compile it again. After finishing the training, the trained file codewords.txt is created on the build/bin. Next, encode input vectors into PQ codes.

$ ./demo_sift1b_encode

This would take one or two hours. You can get codes.bin, which is a set of PQ codes of base vectors in the sift1b data. It takes 4 GB (if M=4). Then, build a PQTable using codes.bin.

$ ./demo_sift1b_build_table

A directory pqtable will be created, which contains the codewords and the table itself. Using these files, you can run the search.

$ ./demo_sift1b_search

You will have a result like this:

top_k: 1
0.0579275 [msec/query] 

The is the runtime per query for the sift1b data. You will also have score.txt, which contains the searched IDs. You can check a recall rate using an evaluation script.

$ cd ../..
$ python scripts/eval.py build/bin/score.txt data/gnd/idx_1000M.ivecs

This scripts evaluates the search result using the groundtruth annotation. The reuslt will be:

Recall@1: 0.002

Note that you can see any top-k results by passing the argment in the search function, e.g.,

$ ./demo_sift1b_search 100   # This creates top-100 results on score.txt
$ cd ../..
$ python scripts/eval.py build/bin/score.txt data/gnd/idx_1000M.ivecs

Then you will see Recall@1, 2, 5, ..., 100.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].