positional-popcount

This repository contains experimental functions for computing the novel "positional population count" (pospopcnt) statistics using fast SIMD instructions. Given a stream of k-bit words, we seek to count the number of set bits in positions 0, 1, 2, ..., k-1. This problem is a generalization of the population-count problem where we count the sum total of set bits in a k-bit word.

These functions can be applied to any packed 1-hot 8-, 16-, 32-, or 64-bit primitive, for example in machine learning/deep learning. Using large registers (AVX-512), we can achieve ~50 GB/s (~0.120 CPU cycles / int) throughput (25 billion 16-bit integers / second or 200 billion one-hot vectors / second).

For production applications please use libalgebra.h.

The first set of subroutines are described in the paper:

Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist and Wojciech Muła and Daniel Lemire

Speedup

This benchmark shows the speedup of the 3 pospopcnt algorithms used on x86 CPUs compared to the efficient auto-vectorization of pospopcnt_u16_scalar_naive for different array sizes (in number of 2-byte values). See Results for additional information.

Algorithm	128	256	512	1024	2048	4096	8192	65536
pospopcnt_u16_sse_blend_popcnt_unroll8	2.09	3.16	2.35	1.88	1.67	1.56	1.5	1.44
pospopcnt_u16_avx512_blend_popcnt_unroll8	1.78	3.61	3.61	3.59	3.68	3.65	3.67	3.7
pospopcnt_u16_avx512_adder_forest	0.77	0.9	3.24	3.96	4.96	5.87	6.52	7.24
pospopcnt_u16_avx512_harley_seal	0.52	0.74	1.83	2.64	4.06	6.43	9.41	16.28

Compared to a naive unvectorized solution (pospopcnt_u16_scalar_naive_nosimd):

Algorithm	128	256	512	1024	2048	4096	8192	65536
pospopcnt_u16_sse_blend_popcnt_unroll8	8.28	9.84	10.55	11	11.58	11.93	12.13	12.28
pospopcnt_u16_avx512_blend_popcnt_unroll8	7.07	11.25	16.21	21	25.49	27.91	29.73	31.55
pospopcnt_u16_avx512_adder_forest	3.05	2.82	14.53	23.13	34.37	44.91	52.78	61.68
pospopcnt_u16_avx512_harley_seal	2.07	2.3	8.21	15.41	28.17	49.14	76.11	138.71

The host architecture used is a 10 nm Cannon Lake Core i3-8121U with gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5).

Usage

For Linux/Mac: Compile the test suite with: make and run ./bench. For Windows: run cmake . then make. The test suite require c++11 whereas the example and functions require only c99. For more detailed test, see Instrumented tests (Linux specific).

Include pospopcnt.h and pospopcnt.c in your project. Then use the wrapper function for pospopcnt:

pospopcnt_u16(datain, length, target_counters);

See example.c for a complete example. Compile with make example.

Note

This is a collaborative effort between Marcus D. R. Klarqvist (@klarqvist), Wojciech Muła (@WojciechMula), and Daniel Lemire (@lemire). We acknowledge @aqrit for contributing with the algorithms pospopcnt_u16_sse2_sad and pospopcnt_u16_scalar_umul128. M.D.R.K. acknowledge James Bonfield (@jkbonfield) for informal discussions that resulted in this work.

Results

We simulated 1 million FLAG fields using a uniform distrubtion U(min,max) with the arguments {1,65536} for 1,000 repetitions using a single core. Numbers represent the average throughput in MB/s (1 MB = 1024b) or average number of CPU cycles per integer. The host architecture used is a 10 nm Cannon Lake Core i3-8121U, a 14 nm Sky Lake Xeon W-2104, and a 22 nm Haswell Xeon E5-2697 v3.

Throughput in CPU cycles / 16-bit integer (lower is better):

Algorithm	Cannon Lake	Sky Lake	Haswell
pospopcnt_u16_scalar_naive	2.049	3.016	3.778
pospopcnt_u16_scalar_naive_nosimd	17.521	17.847	18.031
pospopcnt_u16_scalar_partition	3.079	3.042	3.358
pospopcnt_u16_scalar_hist1x4	2.844	2.953	3.119
pospopcnt_u16_sse_single	3.853	4.023	4.305
pospopcnt_u16_sse_mula	2.074	1.620	2.133
pospopcnt_u16_sse_mula_unroll4	1.569	1.396	1.709
pospopcnt_u16_sse_mula_unroll8	1.427	1.348	1.500
pospopcnt_u16_sse_mula_unroll16	1.379	1.407	1.534
pospopcnt_u16_sse2_sad	1.002	1.004	1.365
pospopcnt_u16_sse2_csa	0.342	0.356	0.428
pospopcnt_u16_avx2_popcnt	2.378	2.334	3.013
pospopcnt_u16_avx2	2.023	3.025	4.012
pospopcnt_u16_avx2_naive_counter	2.022	3.023	3.905
pospopcnt_u16_avx2_single	2.698	2.937	3.589
pospopcnt_u16_avx2_lemire	2.862	1.919	2.187
pospopcnt_u16_avx2_lemire2	1.708	1.139	1.503
pospopcnt_u16_avx2_mula	1.100	1.035	1.305
pospopcnt_u16_avx2_mula_unroll4	0.833	0.775	0.879
pospopcnt_u16_avx2_mula_unroll8	0.745	0.709	0.768
pospopcnt_u16_avx2_mula_unroll16	0.725	0.741	0.805
pospopcnt_u16_avx2_mula3	0.414	0.416	0.479
pospopcnt_u16_avx2_csa	0.199	0.270	0.308
pospopcnt_u16_avx512	1.501	1.616	-
pospopcnt_u16_avx512_popcnt32_mask	0.910	1.375	-
pospopcnt_u16_avx512_popcnt64_mask	0.879	0.995	-
pospopcnt_u16_avx512_masked_ops	1.835	2.000	-
pospopcnt_u16_avx512_popcnt	1.663	1.741	-
pospopcnt_u16_avx512_mula	0.830	0.803	-
pospopcnt_u16_avx512_mula_unroll4	0.642	0.582	-
pospopcnt_u16_avx512_mula_unroll8	0.552	0.588	-
pospopcnt_u16_avx512_mula2	0.509	0.510	-
pospopcnt_u16_avx512_mula3	0.280	0.353	-
pospopcnt_u16_avx512_csa	0.121	0.264	-

Throughput in MB/s (higher is better):

Algorithm	Cannon Lake	Sky Lake	Haswell
pospopcnt_u16_scalar_naive	2952.55	1816.52	1319.05
pospopcnt_u16_scalar_naive_nosimd	345.284	331.54	263.482
pospopcnt_u16_scalar_partition	1962.29	1794.31	1428.73
pospopcnt_u16_scalar_hist1x4	2126.36	1842.85	1575.02
pospopcnt_u16_sse_single	1569.83	1343.2	1149.01
pospopcnt_u16_sse_mula	2916.44	3238.28	2251.89
pospopcnt_u16_sse_mula_unroll4	3845.46	3754.62	2464.27
pospopcnt_u16_sse_mula_unroll8	4238.55	3884.62	3254.86
pospopcnt_u16_sse_mula_unroll16	4374.65	3682.14	3137.09
pospopcnt_u16_sse2_sad	6035.91	5183.01	3732.58
pospopcnt_u16_sse2_csa	17498.6	12466.3	10422.7
pospopcnt_u16_avx2_popcnt	2539.75	2223.02	1647.11
pospopcnt_u16_avx2	2994.27	1796	1134.65
pospopcnt_u16_avx2_naive_counter	2994.27	1809.63	1112.81
pospopcnt_u16_avx2_single	2230.82	1835.75	1340.37
pospopcnt_u16_avx2_lemire	2114.58	2619.98	2228.21
pospopcnt_u16_avx2_lemire2	3519.09	4618.28	3266.01
pospopcnt_u16_avx2_mula	5480.89	3653.92	2435.95
pospopcnt_u16_avx2_mula_unroll4	7224.81	6055.08	3981.94
pospopcnt_u16_avx2_mula_unroll8	8081.99	7143.63	5211.34
pospopcnt_u16_avx2_mula_unroll16	8329.03	6811.96	5512.57
pospopcnt_u16_avx2_mula3	14671.9	11089.2	7817
pospopcnt_u16_avx2_csa	28899.2	15381.8	14559.9
pospopcnt_u16_avx512	4023.94	3249.32	-
pospopcnt_u16_avx512_popcnt32_mask	4721.16	3776.93	-
pospopcnt_u16_avx512_popcnt64_mask	6860.97	5155	-
pospopcnt_u16_avx512_masked_ops	3299.91	4119.54	-
pospopcnt_u16_avx512_popcnt	3571.81	2989.57	-
pospopcnt_u16_avx512_mula	7252.28	6294.88	-
pospopcnt_u16_avx512_mula_unroll4	9395.81	8365.56	-
pospopcnt_u16_avx512_mula_unroll8	10899.1	8292.82	-
pospopcnt_u16_avx512_mula2	11846.9	9633.07	-
pospopcnt_u16_avx512_mula3	21430.9	11995.9	-
pospopcnt_u16_avx512_csa	48906.4	17339.5	-

Instrumented tests (Linux specific)

If you are running Linux, you can run tests that take advantage of Performance Counters for Linux (PCL). This allows for programmatic discovery and enumeration of all counters and events. Compile this test suite with: make instrumented_benchmark. Running the output executable requires root (sudo) access to the host machine. Pass the -v (verbose) flag to get a detailed report of performance counters (example output from the Cannon Lake machine):

$ make instrumented_benchmark 
$ sudo ./instrumented_benchmark -v
n = 10000000 
pospopcnt_u16_scalar_naive_nosimd       	instructions per cycle 3.82, cycles per 16-bit word:  17.524, instructions per 16-bit word 67.002 
min:   675172 cycles,  2581464 instructions, 	       1 branch mis.,        4 cache ref.,        0 cache mis.
avg: 681460.1 cycles, 2581464.2 instructions, 	     2.2 branch mis.,     41.2 cache ref.,      0.6 cache mis.

pospopcnt_u16_scalar_naive              	instructions per cycle 1.99, cycles per 16-bit word:  2.059, instructions per 16-bit word 4.104 
min:    79312 cycles,   158101 instructions, 	       1 branch mis.,       83 cache ref.,        0 cache mis.
avg:  79454.0 cycles, 158101.0 instructions, 	     2.0 branch mis.,    133.4 cache ref.,      0.0 cache mis.

pospopcnt_u16_scalar_partition          	instructions per cycle 2.60, cycles per 16-bit word:  3.090, instructions per 16-bit word 8.038 
min:   119056 cycles,   309690 instructions, 	       1 branch mis.,        4 cache ref.,        0 cache mis.
avg: 120536.1 cycles, 309690.0 instructions, 	     1.2 branch mis.,     31.8 cache ref.,      0.3 cache mis.

pospopcnt_u16_hist1x4                   	instructions per cycle 2.03, cycles per 16-bit word:  2.851, instructions per 16-bit word 5.788 
min:   109856 cycles,   223008 instructions, 	       2 branch mis.,       10 cache ref.,        0 cache mis.
avg: 111365.9 cycles, 223008.0 instructions, 	     2.4 branch mis.,     61.1 cache ref.,      0.3 cache mis.

pospopcnt_u16_sse_single                	instructions per cycle 2.73, cycles per 16-bit word:  3.616, instructions per 16-bit word 9.874 
min:   139336 cycles,   380410 instructions, 	       3 branch mis.,      129 cache ref.,        0 cache mis.
avg: 139815.0 cycles, 380410.0 instructions, 	     5.8 branch mis.,    206.7 cache ref.,      0.8 cache mis.

pospopcnt_u16_sse_mula                  	instructions per cycle 3.00, cycles per 16-bit word:  2.084, instructions per 16-bit word 6.252 
min:    80282 cycles,   240888 instructions, 	       1 branch mis.,        7 cache ref.,        0 cache mis.
avg:  80492.4 cycles, 240888.0 instructions, 	     2.1 branch mis.,     29.6 cache ref.,      0.4 cache mis.

pospopcnt_u16_sse_mula_unroll4          	instructions per cycle 3.35, cycles per 16-bit word:  1.577, instructions per 16-bit word 5.284 
min:    60757 cycles,   203573 instructions, 	       2 branch mis.,       29 cache ref.,        0 cache mis.
avg:  61098.4 cycles, 203573.0 instructions, 	     2.3 branch mis.,     64.2 cache ref.,      0.2 cache mis.

pospopcnt_u16_sse_mula_unroll8          	instructions per cycle 4.08, cycles per 16-bit word:  1.433, instructions per 16-bit word 5.846 
min:    55194 cycles,   225221 instructions, 	       2 branch mis.,       12 cache ref.,        0 cache mis.
avg:  55304.1 cycles, 225221.0 instructions, 	     2.2 branch mis.,     35.0 cache ref.,      0.4 cache mis.

pospopcnt_u16_sse_mula_unroll16         	instructions per cycle 3.62, cycles per 16-bit word:  1.385, instructions per 16-bit word 5.018 
min:    53377 cycles,   193332 instructions, 	       2 branch mis.,       45 cache ref.,        0 cache mis.
avg:  53743.4 cycles, 193332.0 instructions, 	     2.4 branch mis.,     93.8 cache ref.,      0.5 cache mis.

pospopcnt_u16_sse_sad                   	instructions per cycle 3.08, cycles per 16-bit word:  0.985, instructions per 16-bit word 3.033 
min:    37943 cycles,   116844 instructions, 	       1 branch mis.,       65 cache ref.,        0 cache mis.
avg:  38069.8 cycles, 116844.0 instructions, 	     2.1 branch mis.,    100.1 cache ref.,      0.1 cache mis.

pospopcnt_u16_avx2_popcnt               	instructions per cycle 2.56, cycles per 16-bit word:  2.379, instructions per 16-bit word 6.100 
min:    91645 cycles,   235038 instructions, 	       1 branch mis.,      154 cache ref.,        0 cache mis.
avg:  92870.1 cycles, 235038.0 instructions, 	     4.2 branch mis.,    243.9 cache ref.,      2.5 cache mis.

pospopcnt_u16_avx2                      	instructions per cycle 2.01, cycles per 16-bit word:  2.038, instructions per 16-bit word 4.105 
min:    78506 cycles,   158140 instructions, 	       2 branch mis.,      131 cache ref.,        0 cache mis.
avg:  78592.1 cycles, 158140.0 instructions, 	     2.3 branch mis.,    197.1 cache ref.,      0.8 cache mis.

pospopcnt_u16_avx2_naive_counter        	instructions per cycle 2.02, cycles per 16-bit word:  2.034, instructions per 16-bit word 4.104 
min:    78360 cycles,   158124 instructions, 	       1 branch mis.,      102 cache ref.,        0 cache mis.
avg:  78479.2 cycles, 158124.0 instructions, 	     2.3 branch mis.,    162.3 cache ref.,      0.8 cache mis.

pospopcnt_u16_avx2_single               	instructions per cycle 2.02, cycles per 16-bit word:  2.029, instructions per 16-bit word 4.104 
min:    78191 cycles,   158113 instructions, 	       1 branch mis.,      125 cache ref.,        0 cache mis.
avg:  78324.6 cycles, 158113.0 instructions, 	     2.3 branch mis.,    179.0 cache ref.,      0.6 cache mis.

pospopcnt_u16_avx2_lemire               	instructions per cycle 2.80, cycles per 16-bit word:  2.861, instructions per 16-bit word 8.006 
min:   110245 cycles,   308436 instructions, 	       1 branch mis.,       44 cache ref.,        0 cache mis.
avg: 110818.9 cycles, 308436.0 instructions, 	     1.1 branch mis.,    103.3 cache ref.,      0.4 cache mis.

pospopcnt_u16_avx2_lemire2              	instructions per cycle 3.17, cycles per 16-bit word:  1.697, instructions per 16-bit word 5.382 
min:    65395 cycles,   207368 instructions, 	       1 branch mis.,        8 cache ref.,        0 cache mis.
avg:  66261.5 cycles, 207368.0 instructions, 	     1.1 branch mis.,     42.6 cache ref.,      0.6 cache mis.

pospopcnt_u16_avx2_mula                 	instructions per cycle 2.89, cycles per 16-bit word:  1.104, instructions per 16-bit word 3.190 
min:    42544 cycles,   122897 instructions, 	       2 branch mis.,       10 cache ref.,        0 cache mis.
avg:  42804.7 cycles, 122897.0 instructions, 	     3.1 branch mis.,     40.5 cache ref.,      0.4 cache mis.

pospopcnt_u16_avx2_mula2                	instructions per cycle 2.94, cycles per 16-bit word:  1.454, instructions per 16-bit word 4.269 
min:    56002 cycles,   164472 instructions, 	       3 branch mis.,        8 cache ref.,        0 cache mis.
avg:  56256.0 cycles, 164472.0 instructions, 	     3.3 branch mis.,     43.8 cache ref.,      0.4 cache mis.

pospopcnt_u16_avx2_mula_unroll4         	instructions per cycle 3.19, cycles per 16-bit word:  0.847, instructions per 16-bit word 2.706 
min:    32631 cycles,   104244 instructions, 	       1 branch mis.,       10 cache ref.,        0 cache mis.
avg:  32779.7 cycles, 104244.0 instructions, 	     1.9 branch mis.,     38.2 cache ref.,      0.3 cache mis.

pospopcnt_u16_avx2_mula_unroll8         	instructions per cycle 3.95, cycles per 16-bit word:  0.756, instructions per 16-bit word 2.986 
min:    29123 cycles,   115056 instructions, 	       2 branch mis.,       10 cache ref.,        0 cache mis.
avg:  29201.8 cycles, 115056.0 instructions, 	     2.2 branch mis.,     41.4 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx2_mula_unroll16        	instructions per cycle 3.49, cycles per 16-bit word:  0.738, instructions per 16-bit word 2.574 
min:    28416 cycles,    99177 instructions, 	       3 branch mis.,       31 cache ref.,        0 cache mis.
avg:  28535.7 cycles,  99177.0 instructions, 	     3.6 branch mis.,     84.7 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx512                    	instructions per cycle 2.07, cycles per 16-bit word:  1.511, instructions per 16-bit word 3.133 
min:    58218 cycles,   120696 instructions, 	       1 branch mis.,       30 cache ref.,        0 cache mis.
avg:  58299.8 cycles, 120696.0 instructions, 	     2.1 branch mis.,     80.8 cache ref.,      0.6 cache mis.

pospopcnt_u16_avx512_popcnt32_mask      	instructions per cycle 3.21, cycles per 16-bit word:  0.818, instructions per 16-bit word 2.629 
min:    31534 cycles,   101304 instructions, 	       1 branch mis.,        9 cache ref.,        0 cache mis.
avg:  31807.5 cycles, 101304.0 instructions, 	     2.1 branch mis.,     47.0 cache ref.,      0.3 cache mis.

pospopcnt_u16_avx512_popcnt64_mask      	instructions per cycle 3.10, cycles per 16-bit word:  0.835, instructions per 16-bit word 2.593 
min:    32189 cycles,    99912 instructions, 	       1 branch mis.,       60 cache ref.,        0 cache mis.
avg:  32391.7 cycles,  99912.0 instructions, 	     2.6 branch mis.,     93.8 cache ref.,      0.7 cache mis.

pospopcnt_u16_avx512_popcnt             	instructions per cycle 1.96, cycles per 16-bit word:  1.676, instructions per 16-bit word 3.287 
min:    64590 cycles,   126635 instructions, 	       1 branch mis.,       90 cache ref.,        0 cache mis.
avg:  64755.2 cycles, 126635.0 instructions, 	     2.3 branch mis.,    169.1 cache ref.,      1.3 cache mis.

pospopcnt_u16_avx512_mula               	instructions per cycle 2.41, cycles per 16-bit word:  0.752, instructions per 16-bit word 1.815 
min:    28975 cycles,    69920 instructions, 	       1 branch mis.,       17 cache ref.,        0 cache mis.
avg:  29182.1 cycles,  69920.0 instructions, 	     1.7 branch mis.,     50.0 cache ref.,      0.2 cache mis.

pospopcnt_u16_avx512_mula_unroll4       	instructions per cycle 2.52, cycles per 16-bit word:  0.624, instructions per 16-bit word 1.573 
min:    24054 cycles,    60599 instructions, 	       1 branch mis.,       28 cache ref.,        0 cache mis.
avg:  24189.0 cycles,  60599.0 instructions, 	     1.3 branch mis.,     69.2 cache ref.,      0.4 cache mis.

pospopcnt_u16_avx512_mula_unroll8       	instructions per cycle 3.08, cycles per 16-bit word:  0.556, instructions per 16-bit word 1.713 
min:    21410 cycles,    66003 instructions, 	       1 branch mis.,       16 cache ref.,        0 cache mis.
avg:  21472.3 cycles,  66003.0 instructions, 	     1.9 branch mis.,     47.8 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx2_mula3                	instructions per cycle 3.02, cycles per 16-bit word:  0.436, instructions per 16-bit word 1.318 
min:    16804 cycles,    50796 instructions, 	       1 branch mis.,       49 cache ref.,        0 cache mis.
avg:  17025.3 cycles,  50796.0 instructions, 	     3.0 branch mis.,    135.1 cache ref.,      1.1 cache mis.

pospopcnt_u16_avx512_mula3              	instructions per cycle 2.20, cycles per 16-bit word:  0.291, instructions per 16-bit word 0.640 
min:    11194 cycles,    24644 instructions, 	       0 branch mis.,       76 cache ref.,        0 cache mis.
avg:  11295.2 cycles,  24644.0 instructions, 	     0.2 branch mis.,    147.0 cache ref.,      1.5 cache mis.

pospopcnt_u16_avx2_csa                  	instructions per cycle 2.99, cycles per 16-bit word:  0.256, instructions per 16-bit word 0.765 
min:     9854 cycles,    29492 instructions, 	       1 branch mis.,      193 cache ref.,        0 cache mis.
avg:  10114.0 cycles,  29492.0 instructions, 	     1.3 branch mis.,    288.8 cache ref.,      1.9 cache mis.

pospopcnt_u16_avx512_csa                	instructions per cycle 2.09, cycles per 16-bit word:  0.168, instructions per 16-bit word 0.350 
min:     6463 cycles,    13496 instructions, 	       0 branch mis.,      342 cache ref.,        0 cache mis.
avg:   6656.0 cycles,  13496.0 instructions, 	     1.0 branch mis.,    434.4 cache ref.,      2.3 cache mis.

pospopcnt_u16_avx512_mula2              	instructions per cycle 1.99, cycles per 16-bit word:  0.517, instructions per 16-bit word 1.029 
min:    19917 cycles,    39640 instructions, 	       1 branch mis.,       94 cache ref.,        0 cache mis.
avg:  19988.3 cycles,  39640.0 instructions, 	     1.4 branch mis.,    166.3 cache ref.,      0.2 cache mis.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

mklarqvist / positional-popcount

Programming Languages