All Projects → lemire → Streamvbyte

lemire / Streamvbyte

Licence: apache-2.0
Fast integer compression in C using the StreamVByte codec

Projects that are alternatives of or similar to Streamvbyte

Simdcompressionandintersection
A C++ library to compress and intersect sorted lists of integers using SIMD instructions
Stars: ✭ 289 (+48.21%)
Mutual labels:  simd, compression
Turbo-Transpose
Transpose: SIMD Integer+Floating Point Compression Filter
Stars: ✭ 50 (-74.36%)
Mutual labels:  compression, simd
ndzip
A High-Throughput Parallel Lossless Compressor for Scientific Data
Stars: ✭ 19 (-90.26%)
Mutual labels:  compression, simd
Simdcomp
A simple C library for compressing lists of integers using binary packing
Stars: ✭ 331 (+69.74%)
Mutual labels:  simd, compression
Mango
mango fun framework
Stars: ✭ 343 (+75.9%)
Mutual labels:  simd, compression
Compressed Vec
SIMD Floating point and integer compressed vector library
Stars: ✭ 25 (-87.18%)
Mutual labels:  simd, compression
Turbopfor Integer Compression
Fastest Integer Compression
Stars: ✭ 520 (+166.67%)
Mutual labels:  simd, compression
Maskedvbyte
Fast decoder for VByte-compressed integers
Stars: ✭ 91 (-53.33%)
Mutual labels:  simd, compression
Jarchivelib
A simple archiving and compression library for Java
Stars: ✭ 162 (-16.92%)
Mutual labels:  compression
Stegcloak
Hide secrets with invisible characters in plain text securely using passwords 🧙🏻‍♂️⭐
Stars: ✭ 2,379 (+1120%)
Mutual labels:  compression
Compactgui
Visual Interface for the Windows 10 Compact Function
Stars: ✭ 2,103 (+978.46%)
Mutual labels:  compression
Hrank
Pytorch implementation of our CVPR 2020 (Oral) -- HRank: Filter Pruning using High-Rank Feature Map
Stars: ✭ 164 (-15.9%)
Mutual labels:  compression
Ugm
Ubpa Graphics Mathematics
Stars: ✭ 178 (-8.72%)
Mutual labels:  simd
Computelibrary
The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
Stars: ✭ 2,123 (+988.72%)
Mutual labels:  simd
Simdjson
Parsing gigabytes of JSON per second
Stars: ✭ 15,115 (+7651.28%)
Mutual labels:  simd
Pngtastic
A pure Java PNG image optimization and manipulation library
Stars: ✭ 159 (-18.46%)
Mutual labels:  compression
Bit7z
A C++ static library offering a clean and simple interface to the 7-zip DLLs.
Stars: ✭ 159 (-18.46%)
Mutual labels:  compression
Datacompression
Swift libcompression wrapper as an extension for the Data type (GZIP, ZLIB, LZFSE, LZMA, LZ4, deflate, RFC-1950, RFC-1951, RFC-1952)
Stars: ✭ 191 (-2.05%)
Mutual labels:  compression
Decomposed
CATransform3D manipulation made easy.
Stars: ✭ 184 (-5.64%)
Mutual labels:  simd
Compress
Optimized Go Compression Packages
Stars: ✭ 2,478 (+1170.77%)
Mutual labels:  compression

streamvbyte

Build Status

StreamVByte is a new integer compression technique that applies SIMD instructions (vectorization) to Google's Group Varint approach. The net result is faster than other byte-oriented compression techniques.

The approach is patent-free, the code is available under the Apache License.

It includes fast differential coding.

It assumes a recent Intel processor (e.g., haswell or better) or an ARM processor with NEON instructions (which is almost all of them).

The code should build using most standard-compliant C99 compilers. The provided makefile expects a Linux-like system.

Users

This library is used by

Usage

Usage with Makefile:

  make
  ./unit

Usage with CMake:

The cmake build system also offers a libstreamvbyte_static.a in addition to libstreamvbyte.so.

-DCMAKE_INSTALL_PREFIX:PATH=/path/to/install is optional. Defaults to /usr/local{include,lib}

By default, the project builds with -march=native (except on MSVC), use -DSTREAMVBYTE_DISABLE_NATIVE=ON to disable.

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DCMAKE_INSTALL_PREFIX:PATH=/path/to/install \
make install

# run the tests like:
ctest -V

See example.c for an example.

Short code sample:

// suppose that datain is an array of uint32_t integers
size_t compsize = streamvbyte_encode(datain, N, compressedbuffer); // encoding
// here the result is stored in compressedbuffer using compsize bytes
streamvbyte_decode(compressedbuffer, recovdata, N); // decoding (fast)

If the values are sorted, then it might be preferable to use differential coding:

// suppose that datain is an array of uint32_t integers
size_t compsize = streamvbyte_delta_encode(datain, N, compressedbuffer,0); // encoding
// here the result is stored in compressedbuffer using compsize bytes
streamvbyte_delta_decode(compressedbuffer, recovdata, N,0); // decoding (fast)

You have to know how many integers were coded when you decompress. You can store this information along with the compressed stream.

Signed integers

We do not directly support signed integers, but you can use fast functions to convert signed integers to unsigned integers.


#include "streamvbyte_zigzag.h"

zigzag_encode(mysignedints, myunsignedints, number); // mysignedints => myunsignedints

zigzag_decode(myunsignedints, mysignedints, number); // myunsignedints => mysignedints

Installation

You can install the library (as a dynamic library) on your machine if you have root access:

  sudo make install

To uninstall, simply type:

  sudo make uninstall

It is recommended that you try make dyntest before proceeding.

Benchmarking

You can try to benchmark the speed in this manner:

  make perf
  ./perf

Make sure to run make test before, as a sanity test.

Technical posts

Alternative encoding

By default, Stream VByte uses 1, 2, 3 or 4 bytes per integer. In the case where you expect many of your integers to be zero, you might try the streamvbyte_encode_0124 and streamvbyte_decode_0124 which use 0, 1, 2, or 4 bytes per integer.

Stream VByte in other languages

Format Specification

We specify the format as follows.

We do not store how many integers (count) are compressed in the compressed data per se. If you want to store the data stream (e.g., to disk), you need to add this information. It is intentionally left out because, in applications, it is often the case that there are better ways to store this count.

There are two streams:

  • The data starts with an array of "control bytes". There are (count + 3) / 4 of them.
  • Following the array of control bytes, there are data bytes.

We can interpret the control bytes as a sequence of 2-bit words. The first 2-bit word is made of the least significant 2 bits in the first byte, and so forth. There are four 2-bit words written in each byte.

Starting from the first 2-bit word, we have corresponding sequence in the data bytes, written in sequence from the beginning:

  • When the 2-bit word is 00, there is a single data byte.
  • When the 2-bit words is 01, there are two data bytes.
  • When the 2-bit words is 10, there are three data bytes.
  • When the 2-bit words is 11, there are four data bytes.

The data bytes are stored using a little-endian encoding.

Consider the following example:

control bytes: [0x40 0x55 ... ]
data bytes: [0x00 0x64 0xc8 0x2c 0x01 0x90  0x01 0xf4 0x01 0x58 0x02 0xbc 0x02 ...]

The first control byte is 0x40 or the four 2-bit words : 00 00 00 01. The second control byte is 0x55 or the four 2-bit words : 01 01 01 01. Thus the first three values are given by the first three bytes: 0x00, 0x64, 0xc8 (or 0, 100, 200 in base 10). The five next values are stored using two bytes each: 0x2c 0x01, 0x90 0x01, 0xf4 0x01, 0x58 0x02, 0xbc 0x02. As little endian integers, these are to be interpreted as 300, 400, 500, 600, 700.

Thus, to recap, the sequence of integers (0,100,200,300,400,500,600,700) gets encoded as the 15 bytes 0x40 0x55 0x00 0x64 0xc8 0x2c 0x01 0x90 0x01 0xf4 0x01 0x58 0x02 0xbc 0x02.

If the countis not divisible by four, then we include a final partial group where we use zero 2-bit corresponding to no data byte.

Reference

See also

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].