All Projects → siara-cc → Unishox2

siara-cc / Unishox2

Licence: Apache-2.0 license
Compression for Unicode short strings

Programming Languages

c
50402 projects - #5 most used programming language
C++
36643 projects - #6 most used programming language
TeX
3793 projects
CMake
9771 projects
go
31211 projects - #10 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to Unishox2

unishox js
JS Library for Guaranteed compression of Unicode short strings
Stars: ✭ 27 (-78.23%)
Mutual labels:  compression, storage-saving, cost-optimization, json-compression, bandwidth-saver, string-compression-algorithms, string-compression, cloud-cost-intelligence, short-string, database-compression, xml-compression
handbook.vantage.sh
The Cloud Cost Handbook is a free, open-source, community-supported set of guides meant to help explain often-times complex pricing of public cloud infrastructure and service providers in plain english.
Stars: ✭ 265 (+113.71%)
Mutual labels:  cost-optimization, cloud-cost-intelligence
Ec2instances.info
Amazon EC2 instance comparison site
Stars: ✭ 3,619 (+2818.55%)
Mutual labels:  cost-optimization, cloud-cost-intelligence
Guetzling
Guetzling is a simple script for macOS and Linux written in Bash, to automate (recursively finding files) the compression of jpegs using the Guetzli algorithm.
Stars: ✭ 20 (-83.87%)
Mutual labels:  compression
wordpress-plugin
Speed up your WordPress website. Optimize your JPEG and PNG images automatically with TinyPNG.
Stars: ✭ 78 (-37.1%)
Mutual labels:  compression
gorilla
An effective time-series data compression/decompression method based on Facebook's Gorilla.
Stars: ✭ 51 (-58.87%)
Mutual labels:  compression
py-lz4framed
LZ4-frame library for Python (via C bindings)
Stars: ✭ 42 (-66.13%)
Mutual labels:  compression
roadroller
Roadroller: Flattens Your JavaScript Demo
Stars: ✭ 253 (+104.03%)
Mutual labels:  compression
lz4ultra
Optimal LZ4 compressor, that produces files that decompress faster while keeping the best compression ratio
Stars: ✭ 49 (-60.48%)
Mutual labels:  compression
xcdat
Fast compressed trie dictionary library
Stars: ✭ 51 (-58.87%)
Mutual labels:  compression
Turbo-Transpose
Transpose: SIMD Integer+Floating Point Compression Filter
Stars: ✭ 50 (-59.68%)
Mutual labels:  compression
snappy
Fastest Snappy compression library in Node.js
Stars: ✭ 110 (-11.29%)
Mutual labels:  compression
ZipArchive
A single-class pure VB6 library for zip with ASM speed
Stars: ✭ 38 (-69.35%)
Mutual labels:  compression
php-closure-compiler
A PHP Library to use Google Closure Compiler compress Javascript
Stars: ✭ 20 (-83.87%)
Mutual labels:  compression
ZRA
ZStandard Random Access (ZRA) allows random access inside an archive compressed using ZStandard
Stars: ✭ 21 (-83.06%)
Mutual labels:  compression
FrameOfReference
C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference
Stars: ✭ 36 (-70.97%)
Mutual labels:  compression
FastIntegerCompression.js
Fast integer compression library in JavaScript
Stars: ✭ 46 (-62.9%)
Mutual labels:  compression
GainedVAE
A Pytorch Implementation of a continuously rate adjustable learned image compression framework.
Stars: ✭ 43 (-65.32%)
Mutual labels:  compression
Compressor
An android image compression library.
Stars: ✭ 6,745 (+5339.52%)
Mutual labels:  compression
RC-PyTorch
PyTorch code for the CVPR'20 paper "Learning Better Lossless Compression Using Lossy Compression"
Stars: ✭ 44 (-64.52%)
Mutual labels:  compression

Unishox: A hybrid encoder for Short Unicode Strings

C/C++ CI DOI npm ver afl

In general compression utilities such as zip, gzip do not compress short strings well and often expand them. They also use lots of memory which makes them unusable in constrained environments like Arduino. So Unishox algorithm was developed for individually compressing (and decompressing) short strings.

This is a C/C++ library. See here for CPython version and here for Javascript version which is interoperable with this library.

The contenders for Unishox are Smaz, Shoco, Unicode.org's SCSU and BOCU (implementations here and here) and AIMCS (Implementation here).

Note: Unishox provides the best compression for short text and not to be compared with general purpose compression algorithm like lz4, snappy, lzma, brottli and zstd.

Applications

  • Faster transfer of text over low-speed networks such as LORA or BLE
  • Compression for low memory devices such as Arduino and ESP8266
  • Compression of Chat application text exchange including Emojis
  • Storing compressed text in database
  • Bandwidth and storage cost reduction for Cloud

Promo picture

Unishox3 Alpha

The next version Unishox3 which includes multi-level static dictionaries residing in RAM or Flash memory provides much better compression than Unishox2. A preview is available in Unishox3_Alpha folder and a make file is available. To compile please use the following steps:

cd Unishox3_Alpha
make
../usx3 "The quick brown fox jumped over the lazy dog"

This is just a preview and the specification and dictionaries are expected to change before Unishox3 will be released. However, this folder will be retained so if someone used it for compressing strings, they can still use it for decompressing them.

Unishox2 will still be supported for cases where space for storing static dictionaries is an issue.

How it works

Unishox is an hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter in the above Character Set (entropy coding). It also encodes repeating letter sets separately (dictionary coding). For Unicode characters, delta coding is used.

The model used for arriving at the prefix-free code is shown below:

Promo picture

The complete specification can be found in this article: A hybrid encoder for compressing Short Unicode Strings. This can also be found at figshare here with DOI 10.6084/m9.figshare.17056334.v2.

Compiling

To compile, just use make or use gcc as follows:

gcc -std=c99 -o unishox2 test_unishox2.c unishox2.c

Unit tests (automated)

For testing the compiled program, use:

./test_unishox2 -t

This invokes run_unit_tests() function of test_unishox2.c, which tests all the features of Unishox2, including edge cases, using 159 strings covering several languages, emojis and binary data.

Further, the CI pipeline at .github/workflows/c-cpp.yml runs these tests for all presets and also tests file compression for the different types of files in sample_texts folder. This happens whenever a commit is made to the repository.

API

int unishox2_compress_simple(const char *in, int len, char *out);
int unishox2_decompress_simple(const char *in, int len, char *out);

Usage

To see Unishox in action, simply try to compress a string:

./test_unishox2 "Hello World"

To compress and decompress a file, use:

./test_unishox2 -c <input_file> <compressed_file>
./test_unishox2 -d <compressed_file> <decompressed_file>

Note: Unishox is good for text content upto few kilobytes. Unishox does not give good ratios compressing large files or compressing binary files.

Character Set

Unishox supports the entire Unicode character set. As of now it supports UTF-8 as input and output encoding.

Achieving better overall compression

Since Unishox is designed and developed for short texts and other methods are not good for short texts, following logic could be used to achieve better overall compression, since the magic bit(s) at the beginning of compressed bytes can be used to identify Unishox or other methods:

if (size < 1024)
    output = compress_with_unishox(input);
else
    output = compress_with_any_other(input)

The threshold size 1024 is arbitrary and if speed is not a concern, it is also possible to compress with both and use the best.

Interoperability with the JS Library

Strings that were compressed with this library can be decompressed with the JS Library and vice-versa. However please see this section in the documentation for usage.

Projects that use Unishox

Credits

Versions

The present byte-code version is 2 and it replaces Unishox 1. Unishox 1 is still available as unishox1.c, but it will have to be compiled manually if it is needed.

The next version would be Unishox3 and it would include a multi-level static dictionaries residing in RAM or Flash memory that would greatly improve compression ratios compared to Unishox2. However Unishox2 will still be supported for cases where space for storing static dictionaries is an issue.

Issues

In case of any issues, please email the Author (Arundale Ramanathan) at [email protected] or create GitHub issue.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].