Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

yahoo / Lopq

Licence: apache-2.0

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

spark clustering nearest-neighbor-search

Projects that are alternatives of or similar to Lopq

graphgrove

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Stars: ✭ 29 (-94.53%)

Mutual labels: clustering, nearest-neighbor-search

Scanns

A scalable nearest neighbor search library in Apache Spark

Stars: ✭ 190 (-64.15%)

Mutual labels: nearest-neighbor-search, spark

Smile

Statistical Machine Intelligence & Learning Engine

Stars: ✭ 5,412 (+921.13%)

Mutual labels: nearest-neighbor-search, clustering

Elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

Stars: ✭ 298 (-43.77%)

Mutual labels: spark, clustering

Hazelcast

Open-source distributed computation and storage platform

Stars: ✭ 4,662 (+779.62%)

Mutual labels: clustering

God Of Bigdata

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

Stars: ✭ 6,008 (+1033.58%)

Mutual labels: spark

Yanagishima

Web UI for Trino, Presto, Hive, Elasticsearch, SparkSQL

Stars: ✭ 424 (-20%)

Mutual labels: spark

Dji Firmware Tools

Tools for handling firmwares of DJI products, with focus on quadcopters.

Stars: ✭ 424 (-20%)

Mutual labels: spark

Cdap

An open source framework for building data analytic applications.

Stars: ✭ 509 (-3.96%)

Mutual labels: spark

Tensorflow Book

Accompanying source code for Machine Learning with TensorFlow. Refer to the book for step-by-step explanations.

Stars: ✭ 4,448 (+739.25%)

Mutual labels: clustering

Lidar for ad references

A list of references on lidar point cloud processing for autonomous driving

Stars: ✭ 456 (-13.96%)

Mutual labels: clustering

Bigdataie

大数据博客、笔试题、教程、项目、面经的整理

Stars: ✭ 445 (-16.04%)

Mutual labels: spark

Spark

Cross-platform real-time collaboration client optimized for business and organizations.

Stars: ✭ 471 (-11.13%)

Mutual labels: spark

High Performance Spark Examples

Examples for High Performance Spark

Stars: ✭ 436 (-17.74%)

Mutual labels: spark

Dtaidistance

Time series distances: Dynamic Time Warping (DTW)

Stars: ✭ 499 (-5.85%)

Mutual labels: clustering

Shaman

Small, lightweight, api-driven dns server.

Stars: ✭ 426 (-19.62%)

Mutual labels: clustering

Bdp Dataplatform

大数据生态解决方案数据平台：基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。

Stars: ✭ 456 (-13.96%)

Mutual labels: spark

Pointblank

Data validation and organization of metadata for data frames and database tables

Stars: ✭ 480 (-9.43%)

Mutual labels: spark

React Native Map Clustering

React Native map clustering both for Android and iOS.

Stars: ✭ 450 (-15.09%)

Mutual labels: clustering

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+4060%)

Mutual labels: spark

View All Similar Projects ➔

Locally Optimized Product Quantization

This is Python training and testing code for Locally Optimized Product Quantization (LOPQ) models, as well as Spark scripts to scale training to hundreds of millions of vectors. The resulting model can be used in Python with code provided here or deployed via a Protobuf format to, e.g., search backends for high performance approximate nearest neighbor search.

Overview

Locally Optimized Product Quantization (LOPQ) [1] is a hierarchical quantization algorithm that produces codes of configurable length for data points. These codes are efficient representations of the original vector and can be used in a variety of ways depending on the application, including as hashes that preserve locality, as a compressed vector from which an approximate vector in the data space can be reconstructed, and as a representation from which to compute an approximation of the Euclidean distance between points.

Conceptually, the LOPQ quantization process can be broken into 4 phases. The training process also fits these phases to the data in the same order.

The raw data vector is PCA'd to D dimensions (possibly the original dimensionality). This allows subsequent quantization to more efficiently represent the variation present in the data.
The PCA'd data is then product quantized [2] by two k-means quantizers. This means that each vector is split into two subvectors each of dimension D / 2, and each of the two subspaces is quantized independently with a vocabulary of size V. Since the two quantizations occur independently, the dimensions of the vectors are permuted such that the total variance in each of the two subspaces is approximately equal, which allows the two vocabularies to be equally important in terms of capturing the total variance of the data. This results in a pair of cluster ids that we refer to as "coarse codes".
The residuals of the data after coarse quantization are computed. The residuals are then locally projected independently for each coarse cluster. This projection is another application of PCA and dimension permutation on the residuals, and it is "local" in the sense that there is a different projection for each cluster in each of the two coarse vocabularies. These local rotations make the next and final step, another application of product quantization, very efficient in capturing the variance of the residuals.
The locally projected data is then product quantized a final time by M subquantizers, resulting in M "fine codes". Usually the vocabulary for each of these subquantizers will be a power of 2 for effective storage in a search index. With vocabularies of size 256, the fine codes for each indexed vector will require M bytes to store in the index.

The final LOPQ code for a vector is a (coarse codes, fine codes) pair, e.g. ((3, 2), (14, 164, 83, 49, 185, 29, 196, 250)).

Nearest Neighbor Search

A nearest neighbor index can be built from these LOPQ codes by indexing each document into its corresponding coarse code bucket. That is, each pair of coarse codes (which we refer to as a "cell") will index a bucket of the vectors quantizing to that cell.

At query time, an incoming query vector undergoes substantially the same process. First, the query is split into coarse subvectors and the distance to each coarse centroid is computed. These distances can be used to efficiently compute a priority-ordered sequence of cells [3] such that cells later in the sequence are less likely to have near neighbors of the query than earlier cells. The items in cell buckets are retrieved in this order until some desired quota has been met.

After this retrieval phase, the fine codes are used to rank by approximate Euclidean distance. The query is projected into each local space and the distance to each indexed item is estimated as the sum of the squared distances of the query subvectors to the corresponding subquantizer centroids indexed by the fine codes.

NN search with LOPQ is highly scalable and has excellent properties in terms of both index storage requirements and query-time latencies when implemented well.

References

More information and performance benchmarks can be found at http://image.ntua.gr/iva/research/lopq/.

Y. Kalantidis, Y. Avrithis. Locally Optimized Product Quantization for Approximate Nearest Neighbor Search. CVPR 2014.
H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. PAMI, 33(1), 2011.
A. Babenko and V. Lempitsky. The inverted multi-index. CVPR 2012.

Python

Full LOPQ training and evaluation in implemented in the lopq python module. Please refer to the README in python/ for more detail.

Spark

The training algorithm is also implemented on Spark using pyspark to scale parameter fitting to large datasets. Please refer to the README in spark/ for documentation and usage information.

Running Tests

Tests can be run during development by running:

cd python/
bash test.sh

To run tests in a virtual environment this project uses tox. Tox can be installed with pip install tox and run from the python/ directory:

cd python/
tox

License

Code licensed under the Apache License, Version 2.0 license. See LICENSE file for terms.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 530

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (15) 🔗