All Projects → mskimm → spark-annoy

mskimm / spark-annoy

Licence: Apache-2.0 license
Building Annoy Index on Apache Spark

Programming Languages

scala
5932 projects
python
139335 projects - #7 most used programming language
java
68154 projects - #9 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to spark-annoy

Numpy Ml
Machine learning, in numpy
Stars: ✭ 11,100 (+15105.48%)
Mutual labels:  knn
pgvector
Open-source vector similarity search for Postgres
Stars: ✭ 482 (+560.27%)
Mutual labels:  approximate-nearest-neighbor-search
Portrait FCN and 3D Reconstruction
This project is to convert PortraitFCN+ (by Xiaoyong Shen) from Matlab to Tensorflow, then refine the outputs from it (converted to a trimap) using KNN and ResNet, supervised by Richard Berwick.
Stars: ✭ 61 (-16.44%)
Mutual labels:  knn
lshensemble
LSH index for approximate set containment search
Stars: ✭ 48 (-34.25%)
Mutual labels:  approximate-nearest-neighbor-search
elasticsearch-approximate-nearest-neighbor
Plugin to integrate approximate nearest neighbor(ANN) search with Elasticsearch
Stars: ✭ 53 (-27.4%)
Mutual labels:  approximate-nearest-neighbor-search
Annoy
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Stars: ✭ 9,262 (+12587.67%)
Mutual labels:  approximate-nearest-neighbor-search
keras-knn
Code for the blog post Nearest Neighbors with Keras and CoreML
Stars: ✭ 25 (-65.75%)
Mutual labels:  knn
paccmann kinase binding residues
Comparison of active site and full kinase sequences for drug-target affinity prediction and molecular generation. Full paper: https://pubs.acs.org/doi/10.1021/acs.jcim.1c00889
Stars: ✭ 29 (-60.27%)
Mutual labels:  knn
adventures-with-ann
All the code for a series of Medium articles on Approximate Nearest Neighbors
Stars: ✭ 40 (-45.21%)
Mutual labels:  approximate-nearest-neighbor-search
scikit-hubness
A Python package for hubness analysis and high-dimensional data mining
Stars: ✭ 41 (-43.84%)
Mutual labels:  approximate-nearest-neighbor-search
product-quantization
🙃Implementation of vector quantization algorithms, codes for Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search.
Stars: ✭ 40 (-45.21%)
Mutual labels:  approximate-nearest-neighbor-search
NearestNeighborDescent.jl
Efficient approximate k-nearest neighbors graph construction and search in Julia
Stars: ✭ 34 (-53.42%)
Mutual labels:  approximate-nearest-neighbor-search
drowsiness-detection
To identify the driver's drowsiness based on real-time camera image and image processing techniques. 졸음운전 감지 시스템. OpenCV
Stars: ✭ 31 (-57.53%)
Mutual labels:  knn
gongt
NGT Go client library
Stars: ✭ 29 (-60.27%)
Mutual labels:  approximate-nearest-neighbor-search
Trajectory-Analysis-and-Classification-in-Python-Pandas-and-Scikit-Learn
Formed trajectories of sets of points.Experimented on finding similarities between trajectories based on DTW (Dynamic Time Warping) and LCSS (Longest Common SubSequence) algorithms.Modeled trajectories as strings based on a Grid representation.Benchmarked KNN, Random Forest, Logistic Regression classification algorithms to classify efficiently t…
Stars: ✭ 41 (-43.84%)
Mutual labels:  knn
Machine Learning
⚡机器学习实战(Python3):kNN、决策树、贝叶斯、逻辑回归、SVM、线性回归、树回归
Stars: ✭ 5,601 (+7572.6%)
Mutual labels:  knn
Milvus
An open-source vector database for embedding similarity search and AI applications.
Stars: ✭ 9,015 (+12249.32%)
Mutual labels:  approximate-nearest-neighbor-search
instant-distance
Fast approximate nearest neighbor searching in Rust, based on HNSW index
Stars: ✭ 140 (+91.78%)
Mutual labels:  approximate-nearest-neighbor-search
Recommender-Systems
Implementing Content based and Collaborative filtering(with KNN, Matrix Factorization and Neural Networks) in Python
Stars: ✭ 46 (-36.99%)
Mutual labels:  knn
pqlite
⚡ A fast embedded library for approximate nearest neighbor search
Stars: ✭ 141 (+93.15%)
Mutual labels:  approximate-nearest-neighbor-search

Build Status Maven metadata URI

spark-annoy (WIP)

Building Annoy Index on Apache Spark. Then query neighbors using Annoy.

Note

I had built an index of 117M 64-dimensional vectors using 100 nodes in 5 minutes. The settings was;

// version: 0.1.4
// spark.executor.instances = 100
// spark.executor.memory = 8g
// spark.driver.memory = 8g
val fraction = 0.00086 // for about 100k samples
val numTrees = 2
val numPartitions = 100
val annoyModel = new Annoy().setFraction(fraction).setNumTrees(numTrees).fit(dataset)
annoyModel.saveAsAnnoyBinary("/hdfs/path/to/index", numPartitions)

The size of the index is about 33G.

Distributed Builds

import spark.implicits._

val data = spark.read.textFile("data/annoy/sample-glove-25-angular.txt")
  .map { str =>
    val Array(id, features) = str.split("\t")
    (id.toInt, features.split(",").map(_.toFloat))
  }
  .toDF("id", "features")

val ann = new Annoy()
  .setNumTrees(2)

val annModel = ann.fit(data)

annModel.saveAsAnnoyBinary("/path/to/dump/annoy-binary")

Dependency

From the version 0.1.2, it is released to Maven.

libraryDependencies += "com.github.mskimm" %% "ann4s" % "0.1.5"
  • 0.1.5 is built with Apache Spark 2.3.0

How does it work?

  1. builds a parent tree using sampled data on Spark Master
  2. all data are grouped by the leaf node of parent tree on Spark Nodes
  3. builds subtree using the grouped data on each Spark Nodes
  4. aggregate all nodes of subtree to parent tree on Spark Master

Use Case

Index ALS User/Item Factors

  • src/test/scala/ann4s/spark/example/ALSBasedUserItemIndexing.scala
...
val training: DataFrame = _
val als = new ALS()
  .setMaxIter(5)
  .setRegParam(0.01)
  .setUserCol("userId")
  .setItemCol("movieId")
  .setRatingCol("rating")

val model = als.fit(training)

val ann = new Annoy()
  .setNumTrees(2)
  .setFraction(0.1)
  .setIdCol("id")
  .setFeaturesCol("features")

val userAnnModel= ann.fit(model.userFactors)
userAnnModel.writeAnnoyBinary("exp/als/user_factors.ann")

val itemAnnModel = ann.fit(model.itemFactors)
itemAnnModel.writeAnnoyBinary("exp/als/item_factors.ann")
...

Comment

I personally started this project to study Scala. I found out that Annoy is a fairly good library for nearest neighbors search and can be implemented distributed version using Apache Spark. Recently, various bindings and implementations have been actively developed. In particular, the purpose and usability of this project overlap with some projects like annoy4s and annoy-java in terms of running on JVM.

To continue contribution, from now on this project focuses on building Index on Apache Spark for distributed builds. This will support building using 1 billion or more items and writing Annoy compatible binary.

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].