All Projects → ekzhu → lshensemble

ekzhu / lshensemble

Licence: MIT license
LSH index for approximate set containment search

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to lshensemble

Datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+3306.25%)
Mutual labels:  lsh, lsh-forest, lsh-ensemble
annoy.rb
annoy-rb provides Ruby bindings for the Annoy (Approximate Nearest Neighbors Oh Yeah).
Stars: ✭ 23 (-52.08%)
Mutual labels:  nearest-neighbor-search, approximate-nearest-neighbor-search
adventures-with-ann
All the code for a series of Medium articles on Approximate Nearest Neighbors
Stars: ✭ 40 (-16.67%)
Mutual labels:  nearest-neighbor-search, approximate-nearest-neighbor-search
Milvus
An open-source vector database for embedding similarity search and AI applications.
Stars: ✭ 9,015 (+18681.25%)
Mutual labels:  nearest-neighbor-search, approximate-nearest-neighbor-search
Annoy
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Stars: ✭ 9,262 (+19195.83%)
Mutual labels:  nearest-neighbor-search, approximate-nearest-neighbor-search
scikit-hubness
A Python package for hubness analysis and high-dimensional data mining
Stars: ✭ 41 (-14.58%)
Mutual labels:  nearest-neighbor-search, approximate-nearest-neighbor-search
pgvector
Open-source vector similarity search for Postgres
Stars: ✭ 482 (+904.17%)
Mutual labels:  nearest-neighbor-search, approximate-nearest-neighbor-search
Dolphinn
High Dimensional Approximate Near(est) Neighbor
Stars: ✭ 32 (-33.33%)
Mutual labels:  lsh, nearest-neighbor-search
product-quantization
🙃Implementation of vector quantization algorithms, codes for Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search.
Stars: ✭ 40 (-16.67%)
Mutual labels:  lsh, approximate-nearest-neighbor-search
TiDB-A-Raft-based-HTAP-Database
Unofficial! English original and Chinese translation of the paper.
Stars: ✭ 42 (-12.5%)
Mutual labels:  paper, vldb
lsh
Locality Sensitive Hashing for Go (Multi-probe LSH, LSH Forest, basic LSH)
Stars: ✭ 92 (+91.67%)
Mutual labels:  lsh, lsh-forest
External-Attention-pytorch
🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐
Stars: ✭ 7,344 (+15200%)
Mutual labels:  paper
paper
ReScript bindings for react-native-paper
Stars: ✭ 14 (-70.83%)
Mutual labels:  paper
sharemycart
Fight Corona virus by collaborative isolation: Buy your groceries along with your neighbors
Stars: ✭ 18 (-62.5%)
Mutual labels:  containment
lsh-rs
Locality Sensitive Hashing in Rust with Python bindings
Stars: ✭ 64 (+33.33%)
Mutual labels:  lsh
gongt
NGT Go client library
Stars: ✭ 29 (-39.58%)
Mutual labels:  approximate-nearest-neighbor-search
Mirai
Mirai 未来 - A powerful Minecraft Server Software coming from the future
Stars: ✭ 325 (+577.08%)
Mutual labels:  paper
AdvPC
AdvPC: Transferable Adversarial Perturbations on 3D Point Clouds (ECCV 2020)
Stars: ✭ 35 (-27.08%)
Mutual labels:  paper
ode-lstms
Code repository of the paper Learning Long-Term Dependencies in Irregularly-Sampled Time Series
Stars: ✭ 72 (+50%)
Mutual labels:  paper
deep-atrous-guided-filter
Deep Atrous Guided Filter for Image Restoration in Under Display Cameras (UDC Challenge, ECCV 2020).
Stars: ✭ 32 (-33.33%)
Mutual labels:  paper

LSH Ensemble

Build Status GoDoc DOI

Documentation

Please cite this paper if you use this library in your work:

Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller: LSH Ensemble: Internet-Scale Domain Search. PVLDB 9(12): 1185-1196 (2016) Link

Presentation slides @ VLDB 2016, New Delhi.

Datasets

We used two datasets for evaluation. The datasets are all from public domains and can be downloaded directly from the original publisher.

By using these datasets you agree to use them for academic research purpose only, and we shall not be held responisble for any inaccuracy or error that may exist in the datasets, nor we shall be responsible for any consequence of usage of these datasets.

Quick Start Guide

Install this library by running:

go get github.com/ekzhu/lshensemble

Import the library in your import:

import (
	"github.com/ekzhu/lshensemble"
)

First you need to obtain the domains, and each domain should have a string ID. For simplicity we represent a domain as map[string]bool, whose keys are distinct values. Assuming you have obtained the domains from some dataset, you can generate the MinHash signatures from the domains.

domains []map[string]bool
keys []string // each key corresponds to the domain at the same index

// ... 
// obtaining domains and keys
// ...

// initializing the domain records to hold the MinHash signatures
domainRecords := make([]*lshensemble.DomainRecord, len(domains))

// set the minhash seed
seed := 42

// set the number of hash functions
numHash := 256

// create the domain records with the signatures
for i := range domains {
	mh := lshensemble.NewMinhash(seed, numHash)
	for v := range domains[i] {
		mh.Push([]byte(v))
	}
	domainRecords[i] := &lshensemble.DomainRecord {
		Key       : keys[i],
		Size      : len(domains[i]),
		Signature : mh.Signature()
	}
}

Before you can index the domains, you need to sort them in increasing order by their sizes. BySize wrapper allows the domains to tbe sorted using the build-in sort package.

sort.Sort(lshensemble.BySize(domainRecords))

Now you can use BootstrapLshEnsembleOptimal/BootstrapLshEnsembleEquiDepth (or BootstrapLshEnsemblePlusOptimal/BootstrapLshEnsemblePlusEquiDepth) for better accuracy at higher memory cost*) to create an LSH Ensemble index. BootstrapLshEnsembleOptimal uses dynamic programming to create partitions that are optimal in the sense that the total number of false positives generated from all the partitions are minimized. This method can be a bit slower due to the dynamic programming overhead, however, it creates optimized partitions for any kind of data distribution. BootstrapLshEnsembleEquiDepth uses simple equi-depth -- same number of domains in every partition. This is method is described in the original paper as suitable for power-law distributed domain sizes, which is common in real-world domains. You need to specify the number of partitions to use and some other parameters. The LSH parameter K (number of hash functions per band) is dynamically tuned at query-time, but the maximum value should be specified here.

* See explanation for the reason for the "Plus" version.

// Set the number of partitions
numPart := 8

// Set the maximum value for the MinHash LSH parameter K 
// (number of hash functions per band).
maxK := 4

// Create index using equi-depth partitioning
// You can also use BootstrapLshEnsemblePlusEquiDepth for better accuracy
index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, 
    len(domainRecords), lshensemble.Recs2Chan(domainRecords))
if err != nil {
	panic(err)
}

// Create index using optimal partitioning
// You can also use BootstrapLshEnsemblePlusOptimal for better accuracy
index_opt, err := lshensemble.BootstrapLshEnsembleOptimal(numPart, numHash, maxK,
    func () <-chan *lshensemble.DomainRecord { 
        return lshensemble.Recs2Chan(domainRecords); 
    })
if err != nil {
	panic(err)
}

For better memory efficiency when the number of domains is large, it's wiser to use Golang channels and goroutines to pipeline the generation of the signatures, and then use disk-based sorting to sort the domain records. This is why BootstrapLshEnsembleEquiDepth accepts a channel of *DomainRecord as input. For a small number of domains, you simply use Recs2Chan to convert the sorted slice of *DomainRecord into a chan *DomainRecord. To help serializing the domain records to disk, you can use SerializeSignature to serialize the signatures. You need to come up with your own serialization schema for the keys and sizes.

Lastly, you can use Query function, which returns a Golang channel of the keys of the candidates domains, which may contain false positives - domains that do not meet the containment threshold. Therefore, you can optionally include a post-processing step to remove the false positive domains using the original domain values.

// pick a domain to use as the query
querySig := domainRecords[0].Signature
querySize := domainRecords[0].Size

// set the containment threshold
threshold := 0.5

// get the keys of the candidate domains (may contain false positives)
// through a channel with option to cancel early.
done := make(chan struct{})
defer close(done) // Important!!
results := index.Query(querySig, querySize, threshold, done)

for key := range results {
	// ...
	// You may want to include a post-processing step here to remove 
	// false positive domains using the actual domain values.
	// ...
	// You can call break here to stop processing results.
}

Run Canadian Open Data Benchmark

First you need to download the Canadian Open Data domains and extract the domain files into a directory called _cod_domains by running the following command.

tar xzf canadian_open_data_tabular_domains_only.tar.gz

Use Golang's test tool to start the benchmark:

go test -bench=Benchmark_CanadianOpenData -timeout=24h

The benchmark process is in the following order:

  1. Read the domain files into memory
  2. Run Linear Scan to get the ground truth
  3. Run LSH Ensemble to get the query results
  4. Run the accuracy analysis to generate a report on precisions and recalls

Explanation for the Parameter MaxK and Bootstrap Options

MinHash LSH has two parameters K and L (in the paper I used r and b respectively). L is the number of "bands" and K is the number of hash functions per band. The details about the two parameters can be found in Chapter 3 of the textbook, Mining of Massive Datasets.

In LSH Ensemble, we want to allow the K and L of the LSH index in every partition to vary at query time, in order to optimize them for any given query (see Section 5.5 of the paper). We can use achive this by using multiple MinHash LSH, one for each value of K. This allows us to vary the parameter K and L in the following space:

K * L <= number of hash functions (let this be H)
1 <= K <= H

However, this means that for every possible value of K from 1 to H, we need to create a MinHash LSH -- very expensive. So it is not wise to allow K to vary from 1 to H, and that's why we have a MaxK parameter, which bounds K and saves memory. So the new parameter space is:

K * L <= H
1 <= K <= MaxK

It is important to note that it is not the case for L, because we can choose how many "bands" to use at query time.

Now, if we use LSH Forest, we can vary the parameter K from 1 to MaxK at query time with just one LSH. You can read the paper to understand how this can be done (hint: prefix tree). This comes at a price -- the parameter space is more restricted:

MaxK * L <= H
1 <= K <= MaxK

Essentially, we have less freedom in varying L, as 1 <= L <= min{H / MaxK, H} base on the above constraints.

In this library for LSH Ensemble, we provide both implmentations (LSH Forest and "vanilla" MinHash LSH ). Specifically,

  • BootstrapLshEnsembleEquiDepth and BootstrapLshEnsembleOptimal build the index using the LSH Forest implementation, which use less memory but with a more restricted parameter space for optimization.
  • BootstrapLshEnsemblePlusEquiDepth and BootstrapLshEnsemblePlusOptimal build the index using the "vanilla" MinHash LSH implementation (one LSH for every K), which uses more memory (bounded by MaxK) but with no restriction on L.

We found that the optimal K for most queries are less than 4. So in practice you can just set MaxK to 4.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].