Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → WojciechMula → Sse Popcount

WojciechMula / Sse Popcount

Licence: bsd-2-clause

SIMD (SSE) population count --- http://0x80.pl/articles/sse-popcount.html

Labels

aarch64 sse avx2 avx512

Projects that are alternatives of or similar to Sse Popcount

Boost.simd

Boost SIMD

Stars: ✭ 238 (+5.31%)

Mutual labels: aarch64, sse, avx2, avx512

Unisimd Assembler

SIMD macro assembler unified for ARM, MIPS, PPC and x86

Stars: ✭ 63 (-72.12%)

Mutual labels: aarch64, sse, avx2, avx512

SIMD Vector Classes for C++

Stars: ✭ 985 (+335.84%)

Mutual labels: sse, avx2, avx512

Simd

C++ image processing and machine learning library with using of SIMD: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX-512, VMX(Altivec) and VSX(Power7), NEON for ARM.

Stars: ✭ 1,263 (+458.85%)

Mutual labels: sse, avx2, avx512

Base64simd

Base64 coding and decoding with SIMD instructions (SSE/AVX2/AVX512F/AVX512BW/AVX512VBMI/ARM Neon)

Stars: ✭ 115 (-49.12%)

Mutual labels: sse, avx2, avx512

Simde

Implementations of SIMD instruction sets for systems which don't natively support them.

Stars: ✭ 1,012 (+347.79%)

Mutual labels: sse, avx2, avx512

Onednn

oneAPI Deep Neural Network Library (oneDNN)

Stars: ✭ 2,600 (+1050.44%)

Mutual labels: aarch64, avx2, avx512

ternary-logic

Support for ternary logic in SSE, XOP, AVX2 and x86 programs

Stars: ✭ 21 (-90.71%)

Mutual labels: sse, avx2, avx512

Libxsmm

Library for specialized dense and sparse matrix operations, and deep learning primitives.

Stars: ✭ 518 (+129.2%)

Mutual labels: sse, avx2, avx512

Sse4 Strstr

SIMD (SWAR/SSE/SSE4/AVX2/AVX512F/ARM Neon) of Karp-Rabin algorithm's modification

Stars: ✭ 115 (-49.12%)

Mutual labels: sse, avx2, avx512

Nsimd

Agenium Scale vectorization library for CPUs and GPUs

Stars: ✭ 138 (-38.94%)

Mutual labels: aarch64, avx2, avx512

Libsimdpp

Portable header-only C++ low level SIMD library

Stars: ✭ 914 (+304.42%)

Mutual labels: sse, avx2, avx512

simd-byte-lookup

SIMDized check which bytes are in a set

Stars: ✭ 23 (-89.82%)

Mutual labels: sse, avx2, avx512

Quadray Engine

Realtime raytracer using SIMD on ARM, MIPS, PPC and x86

Stars: ✭ 13 (-94.25%)

Mutual labels: sse, avx2, avx512

Toys

Storage for my snippets, toy programs, etc.

Stars: ✭ 187 (-17.26%)

Mutual labels: sse, avx2, avx512

Simdjson

Parsing gigabytes of JSON per second

Stars: ✭ 15,115 (+6588.05%)

Mutual labels: aarch64, avx2

Umesimd

UME::SIMD A library for explicit simd vectorization.

Stars: ✭ 66 (-70.8%)

Mutual labels: avx2, avx512

Hybridizer Basic Samples

Examples of C# code compiled to GPU by hybridizer

Stars: ✭ 186 (-17.7%)

Mutual labels: avx2, avx512

Xsimd

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, NEON, AVX512)

Stars: ✭ 964 (+326.55%)

Mutual labels: sse, avx512

Md5 Simd

Accelerate aggregated MD5 hashing performance up to 8x for AVX512 and 4x for AVX2. Useful for server applications that need to compute many MD5 sums in parallel.

Stars: ✭ 71 (-68.58%)

Mutual labels: avx2, avx512

View All Similar Projects ➔

======================================================================== SIMD popcount

Sample programs for my article http://0x80.pl/articles/sse-popcount.html

.. image:: https://travis-ci.org/WojciechMula/sse-popcount.svg?branch=master :target: https://travis-ci.org/WojciechMula/sse-popcount

Paper

Daniel Lemire, Nathan Kurz and I published an article Faster Population Counts using AVX2 Instructions__.

__ https://arxiv.org/abs/1611.07612

Introduction

Subdirectory original contains code from 2008 --- it is 32-bit and GCC-centric. The root directory contains fresh C++11 code, written with intrinsics and tested on 64-bit machines.

There are two programs:

verify --- it tests if all non-lookup implementations counts bits properly;
speed --- benchmarks different implementations of popcount procedure; please read help to find all options (run the program without arguments).

There are several targets:

default --- builtin functions, SSE and popcnt instructions;
AVX2 --- all above plus AVX2 implementations;
AVX512BW --- all above plus experimental AVX512BW code;
AVX512VBMI --- all above plus experimental AVX512VBMI code;
AVX512 VPOPCNT --- all above plus experimental AVX512 VPOPCNT code (should be compilable with very recent GCC__, software emulator doesn't support this extension yet);
arm --- builtin and ARM Neon implementations.

Type make help to find out details. To run the default target benchmark simply type make.

__ https://github.com/gcc-mirror/gcc/commit/e0aa57d6b04908affdf4655a6b4a9f2d4d03483b

Available implementations

+---------------------------------------+------------------------------------------------------------------+ | procedure | description | +=======================================+==================================================================+ | lookup-8 | lookup in std::uint8_t[256] LUT | +---------------------------------------+------------------------------------------------------------------+ | lookup-64 | lookup in std::uint64_t[256] LUT | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel | naive bit parallel method | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized | a bit better bit parallel | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized2 | better utilization of 2- and 4-bit subwords | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-mul | bit-parallel with fewer instructions | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel32 | naive bit parallel method (32 bit) | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized32 | a bit better bit parallel (32 bit) | +---------------------------------------+------------------------------------------------------------------+ | harley-seal | Harley-Seal popcount (4th iteration) | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel | SSE implementation of bit-parallel-optimized (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel-original | SSE implementation of bit-parallel-optimized | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel-better | SSE implementation of bit-parallel with fewer instructions | +---------------------------------------+------------------------------------------------------------------+ | sse-harley-seal | SSE implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | sse-lookup | SSSE3 variant using pshufb instruction (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | sse-lookup-original | SSSE3 variant using pshufb instruction | +---------------------------------------+------------------------------------------------------------------+ | avx2-lookup | AVX2 variant using pshufb instruction (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | avx2-lookup-original | AVX2 variant using pshufb instruction | +---------------------------------------+------------------------------------------------------------------+ | avx2-harley-seal | AVX2 implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | cpu | CPU instruction popcnt (64-bit variant) | +---------------------------------------+------------------------------------------------------------------+ | sse-cpu | load data with SSE, then count bits using popcnt | +---------------------------------------+------------------------------------------------------------------+ | avx2-cpu | load data with AVX2, then count bits using popcnt | +---------------------------------------+------------------------------------------------------------------+ | avx512-harley-seal | AVX512 implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | avx512bw-shuf | AVX512BW implementation uses shuffle instruction | +---------------------------------------+------------------------------------------------------------------+ | avx512vbmi-shuf | AVX512VBMI implementation uses shuffle instruction | +---------------------------------------+------------------------------------------------------------------+ | avx512-vpopcnt | AVX512 VPOPCNT | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt | builtin for popcnt | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt32 | builtin for popcnt (32-bit variant) | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled | unrolled builtin-popcnt | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled32 | unrolled builtin-popcnt32 | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled-errata | unrolled builtin-popcnt avoiding false-dependency | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled-errata-manual | unrolled builtin-popcnt avoiding false-dependency (asembly code) | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq | builtin-popcnt where data is loaded via SSE registers | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq-unrolled | builtin-popcnt-movdq unrolled | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq-unrolled_manual | builtin-popcnt-movdq unrolled (assembly code) | +---------------------------------------+------------------------------------------------------------------+ | neon-vcnt | ARM Neon using VCNT | +---------------------------------------+------------------------------------------------------------------+ | neon-HS | Harley-Seal using Neon VCNT | +---------------------------------------+------------------------------------------------------------------+ | aarch64-cnt | ARMv8 Neon using CNT | +---------------------------------------+------------------------------------------------------------------+

Performance results

The subdirectory results__ contains performance results from various computers. If you can, please contribute.

__ results/README.rst

Acknowledgments

Kim Walisch (@kimwalisch) wrote Harley-Seal scalar implementation.
Simon Lindholm (@simonlindholm) added unrolled versions of procedures.
Dan Luu (@danluu) agreed to include his procedures (builint-*) into this project. More details in Dan's article Hand coded assembly beats intrinsics in speed and simplicity__

__ http://danluu.com/assembly-intrinsics/

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

WojciechMula / Sse Popcount

Labels

Projects that are alternatives of or similar to Sse Popcount

======================================================================== SIMD popcount

Paper

Introduction

Available implementations

Performance results

Acknowledgments

See also