Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → fstpackage → Fst

fstpackage / Fst

Licence: agpl-3.0

Lightning Fast Serialization of Data Frames for R

Programming Languages

7636 projects

Labels

compression data-frame

Projects that are alternatives of or similar to Fst

Libzip

A C library for reading, creating, and modifying zip archives.

Stars: ✭ 379 (-27.26%)

Mutual labels: compression

Javafastpfor

A simple integer compression library in Java

Stars: ✭ 426 (-18.23%)

Mutual labels: compression

Lzbench

lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors

Stars: ✭ 490 (-5.95%)

Mutual labels: compression

Zson

ZSON is a PostgreSQL extension for transparent JSONB compression

Stars: ✭ 385 (-26.1%)

Mutual labels: compression

Zstd Jni

JNI binding for Zstd

Stars: ✭ 424 (-18.62%)

Mutual labels: compression

Compactor

A user interface for Windows 10 filesystem compression

Stars: ✭ 445 (-14.59%)

Mutual labels: compression

Zipfly

Writing large ZIP archives without memory inflation

Stars: ✭ 363 (-30.33%)

Mutual labels: compression

Rust Brotli

Brotli compressor and decompressor written in rust that optionally avoids the stdlib

Stars: ✭ 504 (-3.26%)

Mutual labels: compression

Draco

Draco is a library for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics.

Stars: ✭ 4,611 (+785.03%)

Mutual labels: compression

Pointblank

Data validation and organization of metadata for data frames and database tables

Stars: ✭ 480 (-7.87%)

Mutual labels: data-frame

Zfp

Compressed numerical arrays that support high-speed random access

Stars: ✭ 384 (-26.3%)

Mutual labels: compression

Httpteleport

Transfer 10Gbps http traffic over 1Gbps networks :)

Stars: ✭ 422 (-19%)

Mutual labels: compression

Aimet

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

Stars: ✭ 453 (-13.05%)

Mutual labels: compression

Ewahboolarray

A compressed bitmap class in C++.

Stars: ✭ 381 (-26.87%)

Mutual labels: compression

Embedded Neural Network

collection of works aiming at reducing model sizes or the ASIC/FPGA accelerator for machine learning

Stars: ✭ 495 (-4.99%)

Mutual labels: compression

Dataframe Js

A javascript library providing a new data structure for datascientists and developpers

Stars: ✭ 376 (-27.83%)

Mutual labels: data-frame

Dwarfs

A fast high compression read-only file system

Stars: ✭ 444 (-14.78%)

Mutual labels: compression

Orz

a high performance, general purpose data compressor written in rust

Stars: ✭ 509 (-2.3%)

Mutual labels: compression

Pgm Index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes

Stars: ✭ 499 (-4.22%)

Mutual labels: compression

Compression

Data compression in TensorFlow

Stars: ✭ 458 (-12.09%)

Mutual labels: compression

View All Similar Projects ➔

Overview

The fst package for R provides a fast, easy and flexible way to serialize data frames. With access speeds of multiple GB/s, fst is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers. Data frames stored in the fst format have full random access, both in column and rows.

The figure below compares the read and write performance of the fst package to various alternatives.

Method	Format	Time (ms)	Size (MB)	Speed (MB/s)	N
readRDS	bin	1577	1000	633	112
saveRDS	bin	2042	1000	489	112
fread	csv	2925	1038	410	232
fwrite	csv	2790	1038	358	241
read_feather	bin	3950	813	253	112
write_feather	bin	1820	813	549	112
read_fst	bin	457	303	2184	282
write_fst	bin	314	303	3180	291

These benchmarks were performed on a laptop (i7 4710HQ @2.5 GHz) with a reasonably fast SSD (M.2 Samsung SM951) using the dataset defined below. Parameter Speed was calculated by dividing the in-memory size of the data frame by the measured time. These results are also visualized in the following graph:

As can be seen from the figure, the measured speeds for the fst package are very high and even top the maximum drive speed of the SSD used. The package accomplishes this by an effective combination of multi-threading and compression. The on-disk file sizes of fst files are also much smaller than that of the other formats tested. This is an added benefit of fst’s use of type-specific compressors on each stored column.

In addition to methods for data frame serialization, fst also provides methods for multi-threaded in-memory compression with the popular LZ4 and ZSTD compressors and an extremely fast multi-threaded hasher.

Multi-threading

The fst package relies heavily on multi-threading to boost the read- and write speed of data frames. To maximize throughput, fst compresses and decompresses data in the background and tries to keep the disk busy writing and reading data at the same time.

Installation

The easiest way to install the package is from CRAN:

install.packages("fst")

You can also use the development version from GitHub:

# install.packages("devtools")
devtools::install_github("fstPackage/fst", ref = "develop")

Basic usage

Using fst is simple. Data can be stored and retrieved using methods write_fst and read_fst:

# Generate some random data frame with 10 million rows and various column types
nr_of_rows <- 1e7

df <- data.frame(
    Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
    Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
    Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
    Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
  )

# Store the data frame to disk
  write.fst(df, "dataset.fst")
  
# Retrieve the data frame again
  df <- read.fst("dataset.fst")

Note: the dataset defined in this example code was also used to obtain the benchmark results shown in the introduction.

Random access

The fst file format provides full random access to stored datasets. You can retrieve a selection of columns and rows with:

  df_subset <- read.fst("dataset.fst", c("Logical", "Factor"), from = 2000, to = 5000)

This reads rows 2000 to 5000 from columns Logical and Factor without actually touching any other data in the stored file. That means that a subset can be read from file without reading the complete file first. This is different from, say, readRDS or read_feather where you have to read the complete file or column before you can make a subset.

Compression

For compression the excellent and speedy LZ4 and ZSTD compression algorithms are used. These compressors (in combination with type-specific bit filters), enable fst to achieve high compression speeds at reasonable compression factors. The compression factor can be tuned from 0 (minimum) to 100 (maximum):

write.fst(df, "dataset.fst", 100)  # use maximum compression

Compression reduces the size of the fst file that holds your data. But because the (de-)compression is done on background threads, it can increase the total read- and write speed as well. The graph below shows how the use of multiple threads enhances the read and write speed of our sample dataset.

The csv format used by the fread and fwrite methods of package data.table is actually a human-readable text format and not a binary format. Normally, binary formats would be much faster than the csv format, because csv takes more space on disk, is row based, uncompressed and needs to be parsed into a computer-native format to have any meaning. So any serializer that’s working on csv has an enormous disadvantage as compared to binary formats. Yet, the results show that data.table is on par with binary formats and when more threads are used, it can even be faster. Because of this impressive performance, it was included in the graph for comparison.

Bindings in other languages

Julia: FstFileFormat.jl A naive Julia binding using RCall.jl

Note to users: From CRAN release v0.8.0, the fst format is stable and backwards compatible. That means that all fst files generated with package v0.8.0 or later can be read by future versions of the package.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 521

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (120) 🔗