All Projects → dynatrace-research → set-sketch-paper

dynatrace-research / set-sketch-paper

Licence: other
SetSketch: Filling the Gap between MinHash and HyperLogLog

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to set-sketch-paper

bagminhash
BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Stars: ✭ 24 (+4.35%)
Mutual labels:  minhash, locality-sensitive-hashing, jaccard-similarity, jaccard-similarity-estimation, minwise-hashing, minwise-hashing-algorithm
HyperMinHash-java
Union, intersection, and set cardinality in loglog space
Stars: ✭ 48 (+108.7%)
Mutual labels:  minhash, hyperloglog, cardinality-estimation, hyperloglog-sketches
Datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+7008.7%)
Mutual labels:  minhash, locality-sensitive-hashing, jaccard-similarity, hyperloglog
hyperloglog-sketch-estimation-paper
Paper about the estimation of cardinalities from HyperLogLog sketches
Stars: ✭ 48 (+108.7%)
Mutual labels:  hyperloglog, sketch-data-structures, cardinality-estimation, hyperloglog-sketches
stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (+160.87%)
Mutual labels:  cosine-similarity, jaccard-similarity, jaccard
ntCard
Estimating k-mer coverage histogram of genomics data
Stars: ✭ 69 (+200%)
Mutual labels:  hyperloglog, cardinality-estimation
Text-Similarity
A text similarity computation using minhashing and Jaccard distance on reuters dataset
Stars: ✭ 15 (-34.78%)
Mutual labels:  jaccard-similarity, minhash-lsh-algorithm
mkmh
Generate kmers/minimizers/hashes/MinHash signatures, including with multiple kmer sizes.
Stars: ✭ 21 (-8.7%)
Mutual labels:  minhash, locality-sensitive-hashing
strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+395.65%)
Mutual labels:  jaccard-similarity, jaccard
tika-similarity
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+300%)
Mutual labels:  cosine-similarity, jaccard-similarity
HyperLogLog
Fast HyperLogLog for Python.
Stars: ✭ 86 (+273.91%)
Mutual labels:  hyperloglog, cardinality-estimation
sketch-search-everywhere
A Sketch plugin for searching layer and selecting it.
Stars: ✭ 53 (+130.43%)
Mutual labels:  sketch
catch-me-if-you-can
plagiarism detector
Stars: ✭ 16 (-30.43%)
Mutual labels:  minhash
uav core
The main integrator of MRS UAV packages in ROS, part of the "mrs_uav_system".
Stars: ✭ 28 (+21.74%)
Mutual labels:  estimation
Chromata
A color plugin for Sketch
Stars: ✭ 15 (-34.78%)
Mutual labels:  sketch
storybook-addons-abstract
Storybook addon for linking Abstract layer and collection shares to stories. Example:
Stars: ✭ 63 (+173.91%)
Mutual labels:  sketch
instance-locator
Sketch plugin to locate instances of a symbol
Stars: ✭ 34 (+47.83%)
Mutual labels:  sketch
finastra-design-kit
The Finastra Design Kit includes customised Design UI kit, components UI kit, icons, fonts, persona template, etc.
Stars: ✭ 23 (+0%)
Mutual labels:  sketch
boring-avatars
Boring avatars is a tiny JavaScript React library that generates custom, SVG-based avatars from any username and color palette.
Stars: ✭ 3,582 (+15473.91%)
Mutual labels:  sketch
spark-stringmetric
Spark functions to run popular phonetic and string matching algorithms
Stars: ✭ 51 (+121.74%)
Mutual labels:  jaccard-similarity

SetSketch: Filling the Gap between MinHash and HyperLogLog

This repository contains the source code to reproduce all the results and figures presented in the paper "SetSketch: Filling the Gap between MinHash and HyperLogLog" which was accepted at VLDB 2021. An extended paper version that includes mathematical proofs and additional results is available on arXiv.

Abstract

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

Steps to reproduce the results and figures on Windows 10

  1. Install Windows Subsystem for Linux (WSL) with Ubuntu 20.04 LTS

  2. Install required packages:

    sudo apt update && sudo apt install gradle g++ libboost-dev python3-matplotlib python3-scipy texlive texlive-latex-extra texlive-fonts-extra
    
  3. Clone repository including submodules:

    git clone --recursive https://github.com/dynatrace-research/set-sketch-paper.git
    
  4. Switch to set-sketch-paper directory:

    cd set-sketch-paper
    
  5. Run simulations (takes several hours):

    gradle runCardinalityTest runJointTest runPerformanceTest
    
  6. Generate all figures:

    gradle pdfFigures
    
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].