All Projects → coolbutuseless → lz4lite

coolbutuseless / lz4lite

Licence: other
Very Fast compression/decompression of in-memory numeric vectors with LZ4

Programming Languages

c
50402 projects - #5 most used programming language
r
7636 projects

lz4lite

Lifecycle: experimental R build status

lz4lite provides access to the extremely fast compression in lz4 for performing in-memory compression.

As of v0.2.0, lz4lite can now serialize and compress any R object understood by base::serialize().

If the input is known to be an atomic, numeric vector, and you do not care about any attributes or names on this vector, then lz4_compress()/lz4_uncompress() can be used. These are bespoke serialization routines for atomic numeric vectors that run faster since they avoid R’s internals.

For a more general solution to fast serialization of R objects, see the fst or qs packages.

Currently lz4 code provided with this package is v1.9.3.

What’s in the box

  • For arbitrary R objects
    • lz4_serialize/lz4_unserialize serialize and compress any R object.
  • For atomic vectors with numeric values
    • lz4_compress()/lz4_uncompress()
      • compress the data within a vector of raw, integer, real, complex or logical values
      • faster than lz4_serialize/unserialize but throws away all attributes i.e. names, dims etc

Installation

You can install from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/lz4lite)

Basic usage of lz4lite

dat <- mtcars


buf <- lz4_serialize(dat)
length(buf) # Number of bytes
#> [1] 1862
# compression ratio
length(buf)/length(serialize(dat, NULL))
#> [1] 0.489099
head(lz4_unserialize(buf))
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Compressing 1 million Integers

library(lz4lite)

max_hc <- 12

set.seed(1)
N                <- 5e6
input_ints       <- sample(1:3, N, prob = (1:3)^3, replace = TRUE)
serialize_base   <- serialize(input_ints, NULL, xdr = FALSE)
serialize_lo     <- lz4_serialize(input_ints, acceleration = 1)
serialize_hi_3   <- lz4hc_serialize(input_ints, level =  3)
serialize_hi_9   <- lz4hc_serialize(input_ints, level =  9)
serialize_hi_12  <- lz4hc_serialize(input_ints, level = max_hc)
compress_lo      <- lz4_compress(input_ints, acceleration = 1)
compress_hi_3    <- lz4hc_compress(input_ints, level = 3)
compress_hi_9    <- lz4hc_compress(input_ints, level = 9)
compress_hi_12   <- lz4hc_compress(input_ints, level = max_hc)
Click here to show/hide benchmark code
library(lz4lite)

res <- bench::mark(
  serialize(input_ints, NULL, xdr = FALSE),
  lz4_serialize(input_ints, acceleration = 1),
  lz4hc_serialize(input_ints, level =  3),
  lz4hc_serialize(input_ints, level =  9),
  lz4hc_serialize(input_ints, level = max_hc),
  lz4_compress (input_ints, acceleration = 1),
  lz4hc_compress (input_ints, level =  3),
  lz4hc_compress (input_ints, level =  9),
  lz4hc_compress (input_ints, level = max_hc),
  check = FALSE
)
expression median itr/sec MB/s compression_ratio
serialize(input_ints, NULL, xdr = FALSE) 18.99ms 50 1004.5 1.000
lz4_serialize(input_ints, acceleration = 1) 30.58ms 32 623.7 0.222
lz4hc_serialize(input_ints, level = 3) 215.84ms 5 88.4 0.155
lz4hc_serialize(input_ints, level = 9) 3.28s 0 5.8 0.088
lz4hc_serialize(input_ints, level = max_hc) 36.09s 0 0.5 0.063
lz4_compress(input_ints, acceleration = 1) 24.16ms 41 789.4 0.222
lz4hc_compress(input_ints, level = 3) 208.71ms 5 91.4 0.155
lz4hc_compress(input_ints, level = 9) 3.28s 0 5.8 0.088
lz4hc_compress(input_ints, level = max_hc) 36.36s 0 0.5 0.063

uncompressing 1 million integers

uncompression speed varies slightly depending upon the compressed size.

Click here to show/hide benchmark code
res <- bench::mark(
  lz4_uncompress(compress_lo),
  lz4_uncompress(compress_hi_3),
  lz4_uncompress(compress_hi_9),
  lz4_uncompress(compress_hi_12)
)
expression median itr/sec MB/s
lz4_uncompress(compress_lo) 12.26ms 79 1555.4
lz4_uncompress(compress_hi_3) 12.37ms 70 1542.4
lz4_uncompress(compress_hi_9) 12.97ms 94 1470.4
lz4_uncompress(compress_hi_12) 6.03ms 121 3161.8

uncompressing 1 million integers

uncompression speed varies slightly depending upon the compressed size.

Click here to show/hide benchmark code
res <- bench::mark(
  unserialize(serialize_base),
  lz4_unserialize(serialize_lo),
  lz4_unserialize(serialize_hi_3),
  lz4_unserialize(serialize_hi_9),
  lz4_unserialize(serialize_hi_12)
)
expression median itr/sec MB/s
unserialize(serialize_base) 6.64ms 120 2871.9
lz4_unserialize(serialize_lo) 29.8ms 38 640.0
lz4_unserialize(serialize_hi_3) 29.38ms 39 649.3
lz4_unserialize(serialize_hi_9) 24.97ms 48 763.8
lz4_unserialize(serialize_hi_12) 23.87ms 49 799.0

Technical bits

Framing of the compressed data

  • lz4lite does not use the standard LZ4 frame to store data.
  • The compressed representation is the compressed data prefixed with a custom 8 byte header consisting of
    • 3 bytes = ‘LZ4’
    • If this was produced with lz4_serialize() the next byte is 0x00, otherwise it is a byte representing the SEXP of the encoded object.
    • 4-byte length value i.e. the number of bytes in the original uncompressed data.
  • This data representation
    • is not compatible with the standard LZ4 frame format.
    • is likely to evolve (so currently do not plan on compressing something in one version of lz4lite and uncompressing in another version.)

Related Software

  • lz4 and zstd - both by Yann Collet
  • fst for serialisation of data.frames using lz4 and zstd
  • qs for fast serialization of arbitrary R objects with lz4 and zstd

Acknowledgements

  • Yann Collett for releasing, maintaining and advancing lz4 and zstd
  • R Core for developing and maintaining such a wonderful language.
  • CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].