All Projects → WojciechMula → Sse Popcount

WojciechMula / Sse Popcount

Licence: bsd-2-clause
SIMD (SSE) population count --- http://0x80.pl/articles/sse-popcount.html

Projects that are alternatives of or similar to Sse Popcount

Boost.simd
Boost SIMD
Stars: ✭ 238 (+5.31%)
Mutual labels:  aarch64, sse, avx2, avx512
Unisimd Assembler
SIMD macro assembler unified for ARM, MIPS, PPC and x86
Stars: ✭ 63 (-72.12%)
Mutual labels:  aarch64, sse, avx2, avx512
Vc
SIMD Vector Classes for C++
Stars: ✭ 985 (+335.84%)
Mutual labels:  sse, avx2, avx512
Simd
C++ image processing and machine learning library with using of SIMD: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX-512, VMX(Altivec) and VSX(Power7), NEON for ARM.
Stars: ✭ 1,263 (+458.85%)
Mutual labels:  sse, avx2, avx512
Base64simd
Base64 coding and decoding with SIMD instructions (SSE/AVX2/AVX512F/AVX512BW/AVX512VBMI/ARM Neon)
Stars: ✭ 115 (-49.12%)
Mutual labels:  sse, avx2, avx512
Simde
Implementations of SIMD instruction sets for systems which don't natively support them.
Stars: ✭ 1,012 (+347.79%)
Mutual labels:  sse, avx2, avx512
Onednn
oneAPI Deep Neural Network Library (oneDNN)
Stars: ✭ 2,600 (+1050.44%)
Mutual labels:  aarch64, avx2, avx512
ternary-logic
Support for ternary logic in SSE, XOP, AVX2 and x86 programs
Stars: ✭ 21 (-90.71%)
Mutual labels:  sse, avx2, avx512
Libxsmm
Library for specialized dense and sparse matrix operations, and deep learning primitives.
Stars: ✭ 518 (+129.2%)
Mutual labels:  sse, avx2, avx512
Sse4 Strstr
SIMD (SWAR/SSE/SSE4/AVX2/AVX512F/ARM Neon) of Karp-Rabin algorithm's modification
Stars: ✭ 115 (-49.12%)
Mutual labels:  sse, avx2, avx512
Nsimd
Agenium Scale vectorization library for CPUs and GPUs
Stars: ✭ 138 (-38.94%)
Mutual labels:  aarch64, avx2, avx512
Libsimdpp
Portable header-only C++ low level SIMD library
Stars: ✭ 914 (+304.42%)
Mutual labels:  sse, avx2, avx512
simd-byte-lookup
SIMDized check which bytes are in a set
Stars: ✭ 23 (-89.82%)
Mutual labels:  sse, avx2, avx512
Quadray Engine
Realtime raytracer using SIMD on ARM, MIPS, PPC and x86
Stars: ✭ 13 (-94.25%)
Mutual labels:  sse, avx2, avx512
Toys
Storage for my snippets, toy programs, etc.
Stars: ✭ 187 (-17.26%)
Mutual labels:  sse, avx2, avx512
Simdjson
Parsing gigabytes of JSON per second
Stars: ✭ 15,115 (+6588.05%)
Mutual labels:  aarch64, avx2
Umesimd
UME::SIMD A library for explicit simd vectorization.
Stars: ✭ 66 (-70.8%)
Mutual labels:  avx2, avx512
Hybridizer Basic Samples
Examples of C# code compiled to GPU by hybridizer
Stars: ✭ 186 (-17.7%)
Mutual labels:  avx2, avx512
Xsimd
C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, NEON, AVX512)
Stars: ✭ 964 (+326.55%)
Mutual labels:  sse, avx512
Md5 Simd
Accelerate aggregated MD5 hashing performance up to 8x for AVX512 and 4x for AVX2. Useful for server applications that need to compute many MD5 sums in parallel.
Stars: ✭ 71 (-68.58%)
Mutual labels:  avx2, avx512

======================================================================== SIMD popcount

Sample programs for my article http://0x80.pl/articles/sse-popcount.html

.. image:: https://travis-ci.org/WojciechMula/sse-popcount.svg?branch=master :target: https://travis-ci.org/WojciechMula/sse-popcount

Paper

Daniel Lemire, Nathan Kurz and I published an article Faster Population Counts using AVX2 Instructions__.

__ https://arxiv.org/abs/1611.07612

Introduction

Subdirectory original contains code from 2008 --- it is 32-bit and GCC-centric. The root directory contains fresh C++11 code, written with intrinsics and tested on 64-bit machines.

There are two programs:

  • verify --- it tests if all non-lookup implementations counts bits properly;
  • speed --- benchmarks different implementations of popcount procedure; please read help to find all options (run the program without arguments).

There are several targets:

  • default --- builtin functions, SSE and popcnt instructions;
  • AVX2 --- all above plus AVX2 implementations;
  • AVX512BW --- all above plus experimental AVX512BW code;
  • AVX512VBMI --- all above plus experimental AVX512VBMI code;
  • AVX512 VPOPCNT --- all above plus experimental AVX512 VPOPCNT code (should be compilable with very recent GCC__, software emulator doesn't support this extension yet);
  • arm --- builtin and ARM Neon implementations.

Type make help to find out details. To run the default target benchmark simply type make.

__ https://github.com/gcc-mirror/gcc/commit/e0aa57d6b04908affdf4655a6b4a9f2d4d03483b

Available implementations

+---------------------------------------+------------------------------------------------------------------+ | procedure | description | +=======================================+==================================================================+ | lookup-8 | lookup in std::uint8_t[256] LUT | +---------------------------------------+------------------------------------------------------------------+ | lookup-64 | lookup in std::uint64_t[256] LUT | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel | naive bit parallel method | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized | a bit better bit parallel | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized2 | better utilization of 2- and 4-bit subwords | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-mul | bit-parallel with fewer instructions | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel32 | naive bit parallel method (32 bit) | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized32 | a bit better bit parallel (32 bit) | +---------------------------------------+------------------------------------------------------------------+ | harley-seal | Harley-Seal popcount (4th iteration) | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel | SSE implementation of bit-parallel-optimized (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel-original | SSE implementation of bit-parallel-optimized | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel-better | SSE implementation of bit-parallel with fewer instructions | +---------------------------------------+------------------------------------------------------------------+ | sse-harley-seal | SSE implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | sse-lookup | SSSE3 variant using pshufb instruction (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | sse-lookup-original | SSSE3 variant using pshufb instruction | +---------------------------------------+------------------------------------------------------------------+ | avx2-lookup | AVX2 variant using pshufb instruction (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | avx2-lookup-original | AVX2 variant using pshufb instruction | +---------------------------------------+------------------------------------------------------------------+ | avx2-harley-seal | AVX2 implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | cpu | CPU instruction popcnt (64-bit variant) | +---------------------------------------+------------------------------------------------------------------+ | sse-cpu | load data with SSE, then count bits using popcnt | +---------------------------------------+------------------------------------------------------------------+ | avx2-cpu | load data with AVX2, then count bits using popcnt | +---------------------------------------+------------------------------------------------------------------+ | avx512-harley-seal | AVX512 implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | avx512bw-shuf | AVX512BW implementation uses shuffle instruction | +---------------------------------------+------------------------------------------------------------------+ | avx512vbmi-shuf | AVX512VBMI implementation uses shuffle instruction | +---------------------------------------+------------------------------------------------------------------+ | avx512-vpopcnt | AVX512 VPOPCNT | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt | builtin for popcnt | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt32 | builtin for popcnt (32-bit variant) | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled | unrolled builtin-popcnt | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled32 | unrolled builtin-popcnt32 | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled-errata | unrolled builtin-popcnt avoiding false-dependency | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled-errata-manual | unrolled builtin-popcnt avoiding false-dependency (asembly code) | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq | builtin-popcnt where data is loaded via SSE registers | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq-unrolled | builtin-popcnt-movdq unrolled | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq-unrolled_manual | builtin-popcnt-movdq unrolled (assembly code) | +---------------------------------------+------------------------------------------------------------------+ | neon-vcnt | ARM Neon using VCNT | +---------------------------------------+------------------------------------------------------------------+ | neon-HS | Harley-Seal using Neon VCNT | +---------------------------------------+------------------------------------------------------------------+ | aarch64-cnt | ARMv8 Neon using CNT | +---------------------------------------+------------------------------------------------------------------+

Performance results

The subdirectory results__ contains performance results from various computers. If you can, please contribute.

__ results/README.rst

Acknowledgments

  • Kim Walisch (@kimwalisch) wrote Harley-Seal scalar implementation.
  • Simon Lindholm (@simonlindholm) added unrolled versions of procedures.
  • Dan Luu (@danluu) agreed to include his procedures (builint-*) into this project. More details in Dan's article Hand coded assembly beats intrinsics in speed and simplicity__

__ http://danluu.com/assembly-intrinsics/

See also

  • libpopcnt__ --- library by Kim Walisch utilizing methods from our paper.

__ https://github.com/kimwalisch/libpopcnt

.. vim: nowrap

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].