All Projects → sunchao → Parquet Rs

sunchao / Parquet Rs

Licence: apache-2.0
Apache Parquet implementation in Rust

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Parquet Rs

hadoop-etl-udfs
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-88.19%)
Mutual labels:  hadoop, parquet
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (-2.78%)
Mutual labels:  hadoop, parquet
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+22.92%)
Mutual labels:  hadoop, parquet
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-83.33%)
Mutual labels:  hadoop, parquet
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+181.94%)
Mutual labels:  hadoop, parquet
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+1040.28%)
Mutual labels:  hadoop, parquet
wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-86.81%)
Mutual labels:  hadoop, parquet
Iceberg
Iceberg is a table format for large, slow-moving tabular data
Stars: ✭ 393 (+172.92%)
Mutual labels:  hadoop, parquet
Parquet4s
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Stars: ✭ 125 (-13.19%)
Mutual labels:  hadoop, parquet
Parquet Go
Go package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.
Stars: ✭ 114 (-20.83%)
Mutual labels:  hadoop, parquet
Drill
Apache Drill is a distributed MPP query layer for self describing data
Stars: ✭ 1,619 (+1024.31%)
Mutual labels:  hadoop, parquet
Airflow Pipeline
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Stars: ✭ 128 (-11.11%)
Mutual labels:  hadoop
Ibis
A pandas-like deferred expression system, with first-class SQL support
Stars: ✭ 1,630 (+1031.94%)
Mutual labels:  hadoop
Datax
DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server
Stars: ✭ 116 (-19.44%)
Mutual labels:  hadoop
Amazon S3 Find And Forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Stars: ✭ 115 (-20.14%)
Mutual labels:  parquet
Hbaseclient
HBase客户端数据管理软件
Stars: ✭ 135 (-6.25%)
Mutual labels:  hadoop
Spydra
Ephemeral Hadoop clusters using Google Compute Platform
Stars: ✭ 128 (-11.11%)
Mutual labels:  hadoop
Asakusafw
Asakusa Framework
Stars: ✭ 114 (-20.83%)
Mutual labels:  hadoop
Griffon Vm
Griffon Data Science Virtual Machine
Stars: ✭ 128 (-11.11%)
Mutual labels:  hadoop
Tensorflowonyarn
Support TensorFlow on YARN
Stars: ✭ 114 (-20.83%)
Mutual labels:  hadoop

parquet-rs

Build Status Coverage Status License API docs Master API docs

An Apache Parquet implementation in Rust.

NOTE: this project has merged into Apache Arrow, and development will continue there. To file an issue or pull request, please file a JIRA in the Arrow project.

Usage

Add this to your Cargo.toml:

[dependencies]
parquet = "0.4"

and this to your crate root:

extern crate parquet;

Example usage of reading data:

use std::fs::File;
use std::path::Path;
use parquet::file::reader::{FileReader, SerializedFileReader};

let file = File::open(&Path::new("/path/to/file")).unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
while let Some(record) = iter.next() {
  println!("{}", record);
}

See crate documentation on available API.

Supported Parquet Version

  • Parquet-format 2.4.0

To update Parquet format to a newer version, check if parquet-format version is available. Then simply update version of parquet-format crate in Cargo.toml.

Features

  • [X] All encodings supported
  • [X] All compression codecs supported
  • [X] Read support
    • [X] Primitive column value readers
    • [X] Row record reader
    • [ ] Arrow record reader
  • [X] Statistics support
  • [X] Write support
    • [X] Primitive column value writers
    • [ ] Row record writer
    • [ ] Arrow record writer
  • [ ] Predicate pushdown
  • [ ] Parquet format 2.5 support
  • [ ] HDFS support

Requirements

  • Rust nightly

See Working with nightly Rust to install nightly toolchain and set it as default.

Build

Run cargo build or cargo build --release to build in release mode. Some features take advantage of SSE4.2 instructions, which can be enabled by adding RUSTFLAGS="-C target-feature=+sse4.2" before the cargo build command.

Test

Run cargo test for unit tests.

Binaries

The following binaries are provided (use cargo install to install them):

  • parquet-schema for printing Parquet file schema and metadata. Usage: parquet-schema <file-path> [verbose], where file-path is the path to a Parquet file, and optional verbose is the boolean flag that allows to print full metadata or schema only (when not specified only schema will be printed).

  • parquet-read for reading records from a Parquet file. Usage: parquet-read <file-path> [num-records], where file-path is the path to a Parquet file, and num-records is the number of records to read from a file (when not specified all records will be printed).

If you see Library not loaded error, please make sure LD_LIBRARY_PATH is set properly:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(rustc --print sysroot)/lib

Benchmarks

Run cargo bench for benchmarks.

Docs

To build documentation, run cargo doc --no-deps. To compile and view in the browser, run cargo doc --no-deps --open.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].