All Projects → traversc → Qs

traversc / Qs

Licence: gpl-3.0
Quick serialization of R objects

Programming Languages

c
50402 projects - #5 most used programming language
r
7636 projects

Projects that are alternatives of or similar to Qs

Watson
WATSON: Wasted but Amazing Turing-incomplete Stack-based Object Notation
Stars: ✭ 258 (+14.67%)
Mutual labels:  encoding, serialization
Bcnencoder.net
Cross-platform texture encoding libary for .NET. With support for BC1-3/DXT, BC4-5/RGTC and BC7/BPTC compression. Outputs files in ktx or dds formats.
Stars: ✭ 28 (-87.56%)
Mutual labels:  encoding, compression
Turbopfor Integer Compression
Fastest Integer Compression
Stars: ✭ 520 (+131.11%)
Mutual labels:  encoding, compression
ikeapack
Compact data serializer/packer written in Go, intended to produce a cross-language usable format.
Stars: ✭ 18 (-92%)
Mutual labels:  serialization, compression
Libbrotli
meta project to build libraries from the brotli source code
Stars: ✭ 110 (-51.11%)
Mutual labels:  encoding, compression
sirdez
Glorious Binary Serialization and Deserialization for TypeScript.
Stars: ✭ 20 (-91.11%)
Mutual labels:  encoding, serialization
Nippy
High-performance serialization library for Clojure
Stars: ✭ 838 (+272.44%)
Mutual labels:  serialization, compression
Pbf
A low-level, lightweight protocol buffers implementation in JavaScript.
Stars: ✭ 618 (+174.67%)
Mutual labels:  encoding, serialization
Msgpack
msgpack.org[Go] MessagePack encoding for Golang
Stars: ✭ 1,353 (+501.33%)
Mutual labels:  encoding, serialization
Binary
Generic and fast binary serializer for Go
Stars: ✭ 86 (-61.78%)
Mutual labels:  encoding, serialization
nason
🗜 Ultra tiny serializer / encoder with plugin-support. Useful to build binary files containing images, strings, numbers and more!
Stars: ✭ 30 (-86.67%)
Mutual labels:  serialization, compression
Cyberchef
The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis
Stars: ✭ 13,674 (+5977.33%)
Mutual labels:  encoding, compression
NBT
A java implementation of the NBT protocol, including a way to implement custom tags.
Stars: ✭ 128 (-43.11%)
Mutual labels:  serialization, compression
sia
Sia - Binary serialisation and deserialisation
Stars: ✭ 52 (-76.89%)
Mutual labels:  encoding, serialization
Bincode
A binary encoder / decoder implementation in Rust.
Stars: ✭ 1,100 (+388.89%)
Mutual labels:  encoding, serialization
Lerc
Limited Error Raster Compression
Stars: ✭ 126 (-44%)
Mutual labels:  encoding, compression
Jsonlab
JSONLab: a native JSON/UBJSON/MassagePack encoder/decoder for MATLAB/Octave
Stars: ✭ 202 (-10.22%)
Mutual labels:  encoding, serialization
Turbobench
Compression Benchmark
Stars: ✭ 211 (-6.22%)
Mutual labels:  compression
Elixir Json
Native JSON library for Elixir
Stars: ✭ 216 (-4%)
Mutual labels:  encoding
Mashumaro
Fast and well tested serialization framework on top of dataclasses
Stars: ✭ 208 (-7.56%)
Mutual labels:  serialization

Using qs

Build Status R-CMD-check CRAN_Status_Badge CRAN_Downloads_Badge CRAN_Downloads_Total_Badge

Quick serialization of R objects

qs provides an interface for quickly saving and reading objects to and from disk. The goal of this package is to provide a lightning-fast and complete replacement for the saveRDS and readRDS functions in R.

Inspired by the fst package, qs uses a similar block-compression design using either the lz4 or zstd compression libraries. It differs in that it applies a more general approach for attributes and object references.

saveRDS and readRDS are the standard for serialization of R data, but these functions are not optimized for speed. On the other hand, fst is extremely fast, but only works on data.frame’s and certain column types.

qs is both extremely fast and general: it can serialize any R object like saveRDS and is just as fast and sometimes faster than fst.

Usage

library(qs)
df1 <- data.frame(x=rnorm(5e6), y=sample(5e6), z=sample(letters,5e6, replace=T))
qsave(df1, "myfile.qs")
df2 <- qread("myfile.qs")

Installation

# CRAN version
install.packages("qs")

# CRAN version compile from source (recommended)
remotes::install_cran("qs", type="source", configure.args="--with-simd=AVX2")

# For earlier versions of R <= 3.4
remotes::install_github("traversc/[email protected]")

Features

The table below compares the features of different serialization approaches in R.

qs fst saveRDS
Not Slow
Numeric Vectors
Integer Vectors
Logical Vectors
Character Vectors
Character Encoding (vector-wide only)
Complex Vectors
Data.Frames
On disk row access
Random column access
Attributes Some
Lists / Nested Lists
Multi-threaded

qs also includes a number of advanced features:

  • For character vectors, qs also has the option of using the new ALTREP system (R version 3.5+) to quickly read in string data.
  • For numerical data (numeric, integer, logical and complex vectors) qs implements byte shuffling filters (adopted from the Blosc meta-compression library). These filters utilize extended CPU instruction sets (either SSE2 or AVX2).
  • qs also efficiently serializes S4 objects, environments, and other complex objects.

These features have the possibility of additionally increasing performance by orders of magnitude, for certain types of data. See sections below for more details.

Summary Benchmarks

The following benchmarks were performed comparing qs, fst and saveRDS/readRDS in base R for serializing and de-serializing a medium sized data.frame with 5 million rows (approximately 115 Mb in memory):

data.frame(a=rnorm(5e6), 
           b=rpois(5e6,100),
           c=sample(starnames$IAU,5e6,T),
           d=sample(state.name,5e6,T),
           stringsAsFactors = F)

qs is highly parameterized and can be tuned by the user to extract as much speed and compression as possible, if desired. For simplicity, qs comes with 4 presets, which trades speed and compression ratio: “fast”, “balanced”, “high” and “archive”.

The plots below summarize the performance of saveRDS, qs and fst with various parameters:

Serializing

De-serializing

(Benchmarks are based on qs ver. 0.21.2, fst ver. 0.9.0 and R 3.6.1.)

Benchmarking write and read speed is a bit tricky and depends highly on a number of factors, such as operating system, the hardware being run on, the distribution of the data, or even the state of the R instance. Reading data is also further subjected to various hardware and software memory caches.

Generally speaking, qs and fst are considerably faster than saveRDS regardless of using single threaded or multi-threaded compression. qs also manages to achieve superior compression ratio through various optimizations (e.g. see “Byte Shuffle” section below).

ALTREP character vectors

The ALTREP system (new as of R 3.5.0) allows package developers to represent R objects using their own custom memory layout. This allows a potentially large speedup in processing certain types of data.

In qs, ALTREP character vectors are implemented via the stringfish package and can be used by setting use_alt_rep=TRUE in the qread function. The benchmark below shows the time it takes to qread several million random strings (nchar = 80) with and without ALTREP.

The large speedup demonstrates why one would want to consider the system, but there are caveats. Downstream processing functions must be ALTREP-aware. See the stringfish package for more details.

Byte Shuffle

Byte shuffling (adopted from the Blosc meta-compression library) is a way of re-organizing data to be more ammenable to compression. An integer contains four bytes and the limits of an integer in R are +/- 2^31-1. However, most real data doesn’t use anywhere near the range of possible integer values. For example, if the data were representing percentages, 0% to 100%, the first three bytes would be unused and zero.

Byte shuffling rearranges the data such that all of the first bytes are blocked together, the second bytes are blocked together, and so on This procedure often makes it very easy for compression algorithms to find repeated patterns and can often improves compression ratio by orders of magnitude. In the example below, shuffle compression achieves a compression ratio of over 1000x. See ?qsave for more details.

# With byte shuffling
x <- 1:1e8
qsave(x, "mydat.qs", preset="custom", shuffle_control=15, algorithm="zstd")
cat( "Compression Ratio: ", as.numeric(object.size(x)) / file.info("mydat.qs")$size, "\n" )
# Compression Ratio:  1389.164

# Without byte shuffling
x <- 1:1e8
qsave(x, "mydat.qs", preset="custom", shuffle_control=0, algorithm="zstd")
cat( "Compression Ratio: ", as.numeric(object.size(x)) / file.info("mydat.qs")$size, "\n" )
# Compression Ratio:  1.479294 

Serializing to memory

You can use qs to directly serialize objects to memory.

Example:

library(qs)
x <- qserialize(c(1,2,3))
qdeserialize(x)
[1] 1 2 3

Serializing objects to ASCII

The qs package includes two sets of utility functions for converting binary data to ASCII:

  • base85_encode and base85_decode
  • base91_encode and base91_decode

These functions are similar to base64 encoding functions found in various packages, but offer greater efficiency.

Example:

enc <- base91_encode(qserialize(datasets::mtcars, preset = "custom", compress_level = 22))
dec <- qdeserialize(base91_decode(enc))

(Note: base91 strings contain double quote characters (") and need to be single quoted if stored as a string.)

See the help files for additional details and history behind these algorithms.

Using qs within Rcpp

qs functions can be called directly within C++ code via Rcpp.

Example C++ script:

// [[Rcpp::depends(qs)]]
#include <Rcpp.h>
#include <qs.h>
using namespace Rcpp;

// [[Rcpp::export]]
void test() {
  qs::qsave(IntegerVector::create(1,2,3), "/tmp/myfile.qs", "high", "zstd", 1, 15, true, 1);
}

R side:

library(qs)
library(Rcpp)
sourceCpp("test.cpp")
# save file using Rcpp interface
test()
# read in file create through Rcpp interface
qread("/tmp/myfile.qs")
[1] 1 2 3

The C++ functions do not have default parameters; all parameters must be specified.

Future developments

  • Additional compression algorithms
  • Improved ALTREP serialization
  • Re-write of multithreading code
  • Mac M1 optimizations (NEON) and checking

Future versions will be backwards compatible with the current version.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].