All Projects → Piezoid → pugz

Piezoid / pugz

Licence: MIT License
Truly parallel gzip decompression

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to pugz

pyrus-cramjam
Thin Python wrapper to de/compression algorithms in Rust - lightweight & no dependencies
Stars: ✭ 40 (-58.33%)
Mutual labels:  gzip, decompression
EasyCompressor
⚡ A compression library that implements many compression algorithms such as LZ4, Zstd, LZMA, Snappy, Brotli, GZip, and Deflate. It helps you to improve performance by reducing Memory Usage and Network Traffic for caching.
Stars: ✭ 167 (+73.96%)
Mutual labels:  gzip, decompression
Util
A collection of useful utility functions
Stars: ✭ 201 (+109.38%)
Mutual labels:  parallel, decompression
Pgbackrest
Reliable PostgreSQL Backup & Restore
Stars: ✭ 766 (+697.92%)
Mutual labels:  gzip, parallel
Libarchivejs
Archive library for browsers
Stars: ✭ 145 (+51.04%)
Mutual labels:  gzip, decompression
power-gzip
POWER9 gzip engine documentation and code samples
Stars: ✭ 16 (-83.33%)
Mutual labels:  gzip, decompression
Compress
Optimized Go Compression Packages
Stars: ✭ 2,478 (+2481.25%)
Mutual labels:  gzip, decompression
Aspnetcore Request Decompression
HTTP request decompression middleware for ASP.NET Core
Stars: ✭ 51 (-46.87%)
Mutual labels:  gzip, decompression
Tinf
Tiny inflate library (inflate, gzip, zlib)
Stars: ✭ 57 (-40.62%)
Mutual labels:  gzip, decompression
lambdafs
Efficient (de)compression package for AWS Lambda
Stars: ✭ 24 (-75%)
Mutual labels:  gzip, decompression
bled
Base Library for Easy Decompression
Stars: ✭ 21 (-78.12%)
Mutual labels:  gzip, decompression
k-means
Code accompanying my blog post on k-means in Python, C++ and CUDA
Stars: ✭ 56 (-41.67%)
Mutual labels:  parallel
shortcut-comparison
Performance comparison of parallel Rust and C++
Stars: ✭ 74 (-22.92%)
Mutual labels:  parallel
executive
🕴Elegant command execution for Node.
Stars: ✭ 37 (-61.46%)
Mutual labels:  parallel
asynckit
Minimal async jobs utility library, with streams support
Stars: ✭ 21 (-78.12%)
Mutual labels:  parallel
distributed
Library to provide Erlang style distributed computations. This library is inspired by Cloud Haskell.
Stars: ✭ 49 (-48.96%)
Mutual labels:  parallel
future.batchtools
🚀 R package future.batchtools: A Future API for Parallel and Distributed Processing using batchtools
Stars: ✭ 77 (-19.79%)
Mutual labels:  parallel
pblat
parallelized blat with multi-threads support
Stars: ✭ 34 (-64.58%)
Mutual labels:  parallel
run-parallel-limit
Run an array of functions in parallel, but limit the number of tasks executing at the same time
Stars: ✭ 70 (-27.08%)
Mutual labels:  parallel
multipart-download
Speed up download of a single file with multiple HTTP GET connections running in parallel
Stars: ✭ 29 (-69.79%)
Mutual labels:  parallel

pugz

Parallel decompression of gzipped text files.

Decompresses text files with a truly parallel algorithm in two passes. (paper for details)

Getting Started

A Linux system on a recent x86_64 CPU is required.

Installing

Type:

make

For maximal performance, disable assertions with:

make asserts=0

Usage

./gunzip -t 8 file.gz

Counting lines is incredibly faster, because there is no thread synchronization:

./gunzip -l -t 8 file.gz

Test

We provide a small example:

cd example
bash test.sh

Decompression speed benchmark

Speed is measured in input compressed bytes processed per second.

Threads gunzip pugz, full decompression pugz, only counting lines
1 55 MB/s 147 MB/s 145 MB/s
3 - 205 MB/s 291 MB/s
6 - 228 MB/s 515 MB/s
12 - 248 MB/s 769 MB/s
24 - 251 MB/s 1052 MB/s

Specs: dual Xeon X5675 (2x 6C/12T), 32 GB RAM NUMA, SSD

On a more recent desktop computer (i7-4770 4C/8T, 16 GB RAM, SSD):

Threads gunzip pugz, full decompression pugz, only counting lines
1 63 MB/s 172 MB/s 170 MB/s
8 - 376 MB/s 565 MB/s

Tested on a 2.7 GB compressed file, with similar results on a 24 GB file.

Script: https://github.com/Piezoid/pugz/blob/master/example/bigger_benchmark.sh

  • Note that the synchronization required for writing to the standard output in order ("pugz, full decompression" column) diminishes a lot the speed up. This is not required if your application can process chunks out of order. Also, this issue can be improved in the future with better IO handling.

  • Contrary to gzip, we don't perform CRC32 calculation. It would roughly inflict a 33% slowdown.

Algorithm overview

Contrary to the pigz program which does single-threaded decompression (see https://github.com/madler/pigz/blob/master/pigz.c#L232), pugz found a way to do truly parallel decompression. In a nutshell: the compressed file is splitted into consecutive sections, processed one after the other. Sections are in turn splitted into chunks (one chunk per thread) and will be decompressed in parallel. A first pass decompresses chunks and keeps track of back-references (see e.g. our paper for the definition of that term), but is unable to resolve them. Then, a quick sequential pass is done to resolve the contexts of all chunks. A final parallel pass translates all unresolved back-references and outputs the file.

Roadmap/TODOs

This is a prototype for proof of concept, so expect some rough corners.

If pugz chokes on some of your large files that you are willing to share, please fill a issue !

  • Pugz is not yet a production-ready gzip decompressor, and may still crash on some files. Or produce undefined behavior when compiled with make asserts=0. This is because blocked/multipart files are not currently supported. (support planned)

  • Only text files with ASCII characters in the range ['\t', '~'] are supported. For example, .tar.gz files are binary thus won't work. Why binary files are not supported: 1) we optimized the guessing of block positions for ASCII files (resulting in less false positives when scanning the bitstream for a deflate block), and 2) we optimized the code to encode unresolved back-references using 8 bits along with the decompressed text. Both are actually optimizations, so we think that binary decompression is eventually conceivable.

  • This codebase is currently only a standalone decompression program, but we would like to turn it into a library with some sort of API (e.g. parallel_gzread(), see issue #6 for discussion) in order to faciliate integration into your favorite software. Right now, the code is a mix between the libdeflate code base (C with gotos) and prototyped C++. It is mostly organized as a header library; however since the source is quite large, we don't think this is the best distribution for it. The middle-ground would be a PIMPL approach with a virtual ABI and some utility wrappers.

  • Proper error handling is non existent (relies on assertions). Propagating errors between threads can be hard but it must be done eventually.

  • Could generate/use an index file for faster random access in two+ passes scenario.

License

This project is licensed under the MIT License.

Citation

@inproceedings{pugz,
author = {Kerbiriou, Mael and Chikhi, Rayan},
year = {2019},
month = {05},
pages = {209-217},
title = {Parallel Decompression of Gzip-Compressed Files and Random Access to DNA Sequences},
doi = {10.1109/IPDPSW.2019.00042},
booktitle={2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}
}

Acknowledgements

ebiggers for writing libdeflate

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].