All Projects → jorgecarleitao → parquet2

jorgecarleitao / parquet2

Licence: other
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Programming Languages

rust
11053 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to parquet2

IMCtermite
Enables extraction of measurement data from binary files with extension 'raw' used by proprietary software imcFAMOS/imcSTUDIO and facilitates its storage in open source file formats
Stars: ✭ 20 (-87.26%)
Mutual labels:  parquet
multissh
A multiprocessed library written in Python and utilising Paramiko.
Stars: ✭ 34 (-78.34%)
Mutual labels:  parallelism
simplecov-parallel
Parallelism support for SimpleCov, currently only for CircleCI 1.0
Stars: ✭ 31 (-80.25%)
Mutual labels:  parallelism
odbc2parquet
A command line tool to query an ODBC data source and write the result into a parquet file.
Stars: ✭ 95 (-39.49%)
Mutual labels:  parquet
Parquet.jl
Julia implementation of Parquet columnar file format reader
Stars: ✭ 93 (-40.76%)
Mutual labels:  parquet
SafeObject
IOS崩溃异常的处理,防止数组越界,字典空值处理
Stars: ✭ 84 (-46.5%)
Mutual labels:  safe
db-safedelete
Attempts to invoke force delete, if it fails - falls back to soft delete
Stars: ✭ 16 (-89.81%)
Mutual labels:  safe
gemini
Sci-Fi galaxy simulation with heavy procedural generation focus
Stars: ✭ 25 (-84.08%)
Mutual labels:  parallelism
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-84.71%)
Mutual labels:  parquet
detox
distributed tox (tox plugin to run testenvs in parallel)
Stars: ✭ 48 (-69.43%)
Mutual labels:  parallelism
hadoop-etl-udfs
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-89.17%)
Mutual labels:  parquet
wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-87.9%)
Mutual labels:  parquet
safe-typeorm
TypeORM helper library enhancing safety in the compilation level
Stars: ✭ 160 (+1.91%)
Mutual labels:  safe
java-multithread
Códigos feitos para o curso de Multithreading com Java, no canal RinaldoDev do YouTube.
Stars: ✭ 24 (-84.71%)
Mutual labels:  parallelism
parquet-usql
A custom extractor designed to read parquet for Azure Data Lake Analytics
Stars: ✭ 13 (-91.72%)
Mutual labels:  parquet
databricks-notebooks
Collection of Databricks and Jupyter Notebooks
Stars: ✭ 19 (-87.9%)
Mutual labels:  parquet
NPB-CPP
NAS Parallel Benchmark Kernels in C/C++. The parallel versions are in FastFlow, TBB, and OpenMP.
Stars: ✭ 18 (-88.54%)
Mutual labels:  parallelism
golang-101
🍺 In-depth internals, my personal notes, example codes and projects. Includes - Thousands of codes, OOP, Concurrency, Parallelism, Goroutines, Mutexes & Wait Groups, Testing in Go, Go tool chain, Backend web development, Some projects including Log file parser using bufio.Scanner, Spam Masker, Retro led clock, Console animations, Dictionary pro…
Stars: ✭ 61 (-61.15%)
Mutual labels:  parallelism
gologger
A concurrent, fast queue/service worker based filesystem logging system perfect for servers with concurrent connections
Stars: ✭ 16 (-89.81%)
Mutual labels:  safe
Spark
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .
Stars: ✭ 55 (-64.97%)
Mutual labels:  parquet

Parquet2

This is a re-write of the official parquet crate with performance, parallelism and safety in mind.

Checkout the guide for details on how to use this crate to read parquet.

The five main differentiators in comparison with parquet are:

  • it uses #![forbid(unsafe_code)]
  • delegates parallelism downstream
  • decouples reading (IO intensive) from computing (CPU intensive)
  • it is faster (10-20x when reading to arrow format)
  • supports async read and write.
  • It is integration-tested against pyarrow and (py)spark 3

The overall idea is to offer the ability to read compressed parquet pages and a toolkit to decompress them to their favourite in-memory format.

This allows this crate's iterators to perform minimal CPU work, thereby maximizing throughput. It is up to the consumers to decide whether they want to take advantage of this through parallelism at the expense of memory usage (e.g. decompress and deserialize pages in threads) or not.

This crate cannot be used directly to read parquet (except metadata). To read data from parquet, checkout arrow2.

Functionality implemented

  • Read dictionary pages
  • Read and write V1 pages
  • Read and write V2 pages
  • Compression and de-compression (all)

Functionality not (yet) implemented

The parquet format has multiple encoding strategies for the different physical types. This crate currently reads from almost all of them, and supports encoding to a subset of them. They are:

Supported decoding

Delta-encodings are still experimental, as I have been unable to generate large pages encoded with them from spark, thereby hindering robust integration tests.

Encoding

Organization

  • read: read metadata and pages
  • write: write metadata and pages
  • encoding: encoders and decoders of the different parquet encodings
  • page: page declarations
  • metadata: parquet files metadata (e.g. FileMetaData)
  • schema: types metadata declaration (e.g. ConvertedType)
  • types.rs: physical type declaration (i.e. how things are represented in memory).
  • statistics: deserialized representation of a parquet page
  • compression: compressors and decompressors compression (e.g. Gzip)
  • error: errors declaration

Run integration tests

There are integration tests against parquet files generated by pyarrow. To run then, you will need to run

python3 -m venv venv
venv/bin/pip install pip --upgrade
venv/bin/pip install pyarrow==7
venv/bin/python tests/write_pyarrow.py
cargo test

before. This is only needed once (per change in the tests/write_pyarrow.py).

How to implement page readers

The in-memory format used to consume parquet pages strongly influences how the pages should be deserialized. As such, this crate does not commit to a particular in-memory format. Consumers are responsible for converting pages to their target in-memory format.

This git repository contains a serialization to a simple in-memory format in integration, that is used to validate integration with other implementations.

There is also an implementation for the arrow format here.

Higher Parallelism

Typically, converting a page into memory is expensive and thus consider how to distribute work across threads. E.g.

let handles = vec![];
for column in columns {
    let column_meta = metadata.row_groups[row_group].column(column);
    let compressed_pages = get_page_iterator(column_meta, &mut file, file)?.collect()?;
    // each compressed_page has a buffer; cloning is expensive(!). We move it so that the memory
    // is released at the end of the processing.
    handles.push(thread::spawn move {
        page_iter_to_array(compressed_pages.into_iter())
    })
}
let columns_from_all_groups = handles.join_all();

this will read the file as quickly as possible in the main thread and send CPU-intensive work to other threads, thereby maximizing IO reads (at the cost of storing multiple compressed pages in memory; buffering is also an option here).

Decoding flow

Generally, a parquet file is read as follows:

  1. Read metadata
  2. Seek a row group and column
  3. iterate over (compressed) pages within that (group, column)

This is IO-intensive, requires parsing thrift, and seeking within a file.

Once a compressed page is loaded into memory, it can be decompressed, decoded and deserialized into a specific in-memory format. All of these operations are CPU-intensive and are thus left to consumers to perform, as they may want to send this work to threads.

read -> compressed page -> decompressed page -> decoded bytes -> deserialized

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].