dnbaker / Sketch
Labels
Projects that are alternatives of or similar to Sketch
sketch
sketch is a generic, header-only library providing implementations of a variety of sketch data structures for scalable and streaming applications.
All have been accelerated with SIMD parallelism where possible, most are composable, and many are threadsafe unless -DNOT_THREADSAFE
is passed as a compilation flag.
Contents
- HyperLogLog Implementation [hll.h]
-
hll_t
/hllbase_t<HashStruct>
- Estimates the cardinality of a set using log(log(cardinality)) bits.
- Threadsafe unless
-DNOT_THREADSAFE
is passed. - Currently,
hll
is the only structure for which python bindings are available, but we intend to extend this in the future.
-
- HyperBitBit [hbb.h]
- Better per-bit accuracy than HyperLogLogs, but, at least currently, limited to 128 bits/16 bytes in sketch size.
- Bloom Filter [bf.h]
-
bf_t
/bfbase_t<HashStruct>
- Naive bloom filter
- Currently not threadsafe.
-
- Count-Min and Count Sketches
- ccm.h (
ccmbase_t<UpdatePolicy=Increment>/ccm_t
(usepccm_t
for Approximate Counting orcs_t
for a count sketch). - The Count sketch is threadsafe if
-DNOT_THREADSAFE
is not passed or if an atomic container is used. Count-Min sketches are currently not threadsafe due to the use of minimal updates. - Count-min sketches can support concept drift if
realccm_t
from mult.h is used.
- ccm.h (
- MinHash sketches
- mh.h (
RangeMinHash
is the currently verified implementation.) We recommend you build the sketch and then convert to a linear container (e.g., astd::vector
) usingto_container<ContainerType>()
or.finalize()
for faster comparisons.- BottomKHasher is an alternate that uses more space to reduce runtime, which finalizes() into the same structure.
- CountingRangeMinHash performs the same operations as RangeMinHash, but provides multiplicities, which facilitates
histogram_similarity
, a generalization of Jaccard with multiplicities. - Both CountingRangeMinHash and RangeMinHash can be finalized into containers for fast comparisons with
.finalize()
. - A draft HyperMinHash implementation is available as well, but it has not been thoroughly vetted.
- Range MinHash implementationsare not threadsafe.
- HyperMinHash implementation is threa
- mh.h (
- B-Bit MinHash
- bbmh.h
- One-permutation (partition) bbit minhash
- Threadsafe, bit-packed and fully SIMD-accelerated
- Power of two partitions are supported in BBitMinHasher, which is finalized into a FinalBBitMinHash sketch. This is faster than the alternative.
- We also support arbitrary divisions using fastmod64 with DivBBitMinHasher and its corresponding final sketch, FinalDivBBitMinHash.
- One-permutation counting bbit minhash
- Not threadsafe.
- ModHash sketches
- mod.h
- Estimates both containment and jaccard index, but takes a (potentially) unbounded space.
- This returns a FinalRMinHash sketch, reusing the infrastructure for minhash sketches, but which calculates Jaccard index and containment knowing that it was generated via mod, not min.
- HeavyKeeper
- hk.h
- Reference: https://www.usenix.org/conference/atc18/presentation/gong
- A seemingly unilateral improvement over count-min sketches.
- One drawback is the inability to delete items, which makes it unsuitable for sliding windows.
- It shares this characteristic with the Count-Min sketch with conservative update and the Count-Min Mean sketch.
- ntcard
- mult.h
- Threadsafe
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/28453674
- Not SIMD-accelerated, but also general, supporting any arbitrary coverage level
- PCSA
- pc.h
- The HLL is more performant and better-optimized, but this is included for completeness.
- Not threadsafe.
- Reference: https://dl.acm.org/doi/10.1016/0022-0000%2885%2990041-8 Journal of Computer and System Sciences. September 1985 https://doi.org/10.1016/0022-0000(85)90041-8
- SetSketch
- See setsketch.h for continuous and discretized versions of the SetSketch.
- This also includes parameter-setting code.
Test case
To build and run the hll test case:
make test && ./test
Example
To use as a header-only library:
using namespace sketch;
hll::hll_t hll(14); // Use 2**14 bytes for this structure
// Add hashed values for each element to the structure.
for(uint64_t i(0); i < 10000000ull; ++i) hll.addh(i);
fprintf(stderr, "Elements estimated: %lf. Error bounds: %lf.\n", hll.report(), hll.est_err());
The other structures work with a similar interface. See the type constructors for more information or view 10xdash for examples on using the same interface for a variety of data structures.
Simply #include sketch/<header_name>
, or, for one include #include <sketch/sketch.h>
,
which allows you to write sketch::bf_t
and sketch::hll_t
without the subnamespaces.
We use inline namespaces for individual types of sketches, e.g., sketch::minhash
or sketch::hll
can be used for clarity, or sketch::hll_t
can be used, omitting the hll
namespace.
OSX Installation
Clang on OSX may fail to compile in AVX512 mode. We recommend using homebrew's gcc:
homebrew upgrade gcc || homebrew install gcc
and either setting the environmental variables for CXX and CC to the most recent g++/gcc or providing them as Makefile arguments.
At the time of writing, this is g++-10
and gcc-10
.
Multithreading
By default, updates to the hyperloglog structure to occur using atomic operations, though threading should be handled by the calling code. Otherwise, the flag -DNOT_THREADSAFE
should be passed. The cost of this is relatively minor, but in single-threaded situations, this would be preferred.
Python bindings
Python bindings are available via pybind11. Simply cd python && python setup.py install
.