All Projects → dnbaker → Sketch

dnbaker / Sketch

Licence: mit
C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings

Projects that are alternatives of or similar to Sketch

Simde
Implementations of SIMD instruction sets for systems which don't natively support them.
Stars: ✭ 1,012 (+954.17%)
Mutual labels:  simd
Dictionary
High-performance dictionary coding
Stars: ✭ 77 (-19.79%)
Mutual labels:  simd
Maskedvbyte
Fast decoder for VByte-compressed integers
Stars: ✭ 91 (-5.21%)
Mutual labels:  simd
Unisimd Assembler
SIMD macro assembler unified for ARM, MIPS, PPC and x86
Stars: ✭ 63 (-34.37%)
Mutual labels:  simd
Md5 Simd
Accelerate aggregated MD5 hashing performance up to 8x for AVX512 and 4x for AVX2. Useful for server applications that need to compute many MD5 sums in parallel.
Stars: ✭ 71 (-26.04%)
Mutual labels:  simd
Ozz Animation
Open source c++ skeletal animation library and toolset
Stars: ✭ 1,250 (+1202.08%)
Mutual labels:  simd
App comments spider
爬取百度贴吧、TapTap、appstore、微博官方博主上的游戏评论(基于redis_scrapy),过滤器采用了bloomfilter。
Stars: ✭ 38 (-60.42%)
Mutual labels:  bloom-filter
Algorithms Study Group
Study group for algorithms in Ruby, hosted at App Academy
Stars: ✭ 94 (-2.08%)
Mutual labels:  bloom-filter
Wide
A crate to help you go wide. By which I mean use SIMD stuff.
Stars: ✭ 72 (-25%)
Mutual labels:  simd
Faster
SIMD for humans
Stars: ✭ 1,304 (+1258.33%)
Mutual labels:  simd
Pocket Tensor
Run Keras models from a C++ application on embedded devices
Stars: ✭ 65 (-32.29%)
Mutual labels:  simd
Umesimd
UME::SIMD A library for explicit simd vectorization.
Stars: ✭ 66 (-31.25%)
Mutual labels:  simd
Simd
C++ image processing and machine learning library with using of SIMD: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX-512, VMX(Altivec) and VSX(Power7), NEON for ARM.
Stars: ✭ 1,263 (+1215.63%)
Mutual labels:  simd
Str
A SIMD optimized fixed-length string class along with an adaptive hash table for fast searching
Stars: ✭ 60 (-37.5%)
Mutual labels:  simd
Amplifier.net
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
Stars: ✭ 92 (-4.17%)
Mutual labels:  simd
Mongoose
Minimalistic Vulkan engine for fast propotyping.
Stars: ✭ 41 (-57.29%)
Mutual labels:  simd
Bloom
C++ Bloom Filter Library
Stars: ✭ 77 (-19.79%)
Mutual labels:  bloom-filter
Boomfilters
Probabilistic data structures for processing continuous, unbounded streams.
Stars: ✭ 1,333 (+1288.54%)
Mutual labels:  bloom-filter
Bloomex
🌺 A pure Elixir implementation of Scalable Bloom Filters
Stars: ✭ 93 (-3.12%)
Mutual labels:  bloom-filter
Despacer
C library to remove white space from strings as fast as possible
Stars: ✭ 90 (-6.25%)
Mutual labels:  simd

sketch Build Status

sketch is a generic, header-only library providing implementations of a variety of sketch data structures for scalable and streaming applications. All have been accelerated with SIMD parallelism where possible, most are composable, and many are threadsafe unless -DNOT_THREADSAFE is passed as a compilation flag.

Contents

  1. HyperLogLog Implementation [hll.h]
    1. hll_t/hllbase_t<HashStruct>
    2. Estimates the cardinality of a set using log(log(cardinality)) bits.
    3. Threadsafe unless -DNOT_THREADSAFE is passed.
    4. Currently, hll is the only structure for which python bindings are available, but we intend to extend this in the future.
  2. HyperBitBit [hbb.h]
    1. Better per-bit accuracy than HyperLogLogs, but, at least currently, limited to 128 bits/16 bytes in sketch size.
  3. Bloom Filter [bf.h]
    1. bf_t/bfbase_t<HashStruct>
    2. Naive bloom filter
    3. Currently not threadsafe.
  4. Count-Min and Count Sketches
    1. ccm.h (ccmbase_t<UpdatePolicy=Increment>/ccm_t (use pccm_t for Approximate Counting or cs_t for a count sketch).
    2. The Count sketch is threadsafe if -DNOT_THREADSAFE is not passed or if an atomic container is used. Count-Min sketches are currently not threadsafe due to the use of minimal updates.
    3. Count-min sketches can support concept drift if realccm_t from mult.h is used.
  5. MinHash sketches
    1. mh.h (RangeMinHash is the currently verified implementation.) We recommend you build the sketch and then convert to a linear container (e.g., a std::vector) using to_container<ContainerType>() or .finalize() for faster comparisons.
      1. BottomKHasher is an alternate that uses more space to reduce runtime, which finalizes() into the same structure.
    2. CountingRangeMinHash performs the same operations as RangeMinHash, but provides multiplicities, which facilitates histogram_similarity, a generalization of Jaccard with multiplicities.
    3. Both CountingRangeMinHash and RangeMinHash can be finalized into containers for fast comparisons with .finalize().
    4. A draft HyperMinHash implementation is available as well, but it has not been thoroughly vetted.
    5. Range MinHash implementationsare not threadsafe.
    6. HyperMinHash implementation is threa
  6. B-Bit MinHash
    1. bbmh.h
    2. One-permutation (partition) bbit minhash
      1. Threadsafe, bit-packed and fully SIMD-accelerated
      2. Power of two partitions are supported in BBitMinHasher, which is finalized into a FinalBBitMinHash sketch. This is faster than the alternative.
      3. We also support arbitrary divisions using fastmod64 with DivBBitMinHasher and its corresponding final sketch, FinalDivBBitMinHash.
    3. One-permutation counting bbit minhash
      1. Not threadsafe.
  7. ModHash sketches
    1. mod.h
    2. Estimates both containment and jaccard index, but takes a (potentially) unbounded space.
    3. This returns a FinalRMinHash sketch, reusing the infrastructure for minhash sketches, but which calculates Jaccard index and containment knowing that it was generated via mod, not min.
  8. HeavyKeeper
    1. hk.h
    2. Reference: https://www.usenix.org/conference/atc18/presentation/gong
    3. A seemingly unilateral improvement over count-min sketches.
      1. One drawback is the inability to delete items, which makes it unsuitable for sliding windows.
      2. It shares this characteristic with the Count-Min sketch with conservative update and the Count-Min Mean sketch.
  9. ntcard
    1. mult.h
    2. Threadsafe
    3. Reference: https://www.ncbi.nlm.nih.gov/pubmed/28453674
    4. Not SIMD-accelerated, but also general, supporting any arbitrary coverage level
  10. PCSA
    1. pc.h
    2. The HLL is more performant and better-optimized, but this is included for completeness.
    3. Not threadsafe.
    4. Reference: https://dl.acm.org/doi/10.1016/0022-0000%2885%2990041-8 Journal of Computer and System Sciences. September 1985 https://doi.org/10.1016/0022-0000(85)90041-8
  11. SetSketch
    1. See setsketch.h for continuous and discretized versions of the SetSketch.
    2. This also includes parameter-setting code.

Test case

To build and run the hll test case:

make test && ./test

Example

To use as a header-only library:

using namespace sketch;
hll::hll_t hll(14); // Use 2**14 bytes for this structure
// Add hashed values for each element to the structure.
for(uint64_t i(0); i < 10000000ull; ++i) hll.addh(i);
fprintf(stderr, "Elements estimated: %lf. Error bounds: %lf.\n", hll.report(), hll.est_err());

The other structures work with a similar interface. See the type constructors for more information or view 10xdash for examples on using the same interface for a variety of data structures.

Simply #include sketch/<header_name>, or, for one include #include <sketch/sketch.h>, which allows you to write sketch::bf_t and sketch::hll_t without the subnamespaces.

We use inline namespaces for individual types of sketches, e.g., sketch::minhash or sketch::hll can be used for clarity, or sketch::hll_t can be used, omitting the hll namespace.

OSX Installation

Clang on OSX may fail to compile in AVX512 mode. We recommend using homebrew's gcc:

homebrew upgrade gcc || homebrew install gcc

and either setting the environmental variables for CXX and CC to the most recent g++/gcc or providing them as Makefile arguments. At the time of writing, this is g++-10 and gcc-10.

Multithreading

By default, updates to the hyperloglog structure to occur using atomic operations, though threading should be handled by the calling code. Otherwise, the flag -DNOT_THREADSAFE should be passed. The cost of this is relatively minor, but in single-threaded situations, this would be preferred.

Python bindings

Python bindings are available via pybind11. Simply cd python && python setup.py install.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].