All Projects → hannesmuehleisen → miniparquet

hannesmuehleisen / miniparquet

Licence: other
Library to read a subset of Parquet files

Programming Languages

C++
36643 projects - #6 most used programming language
Thrift
134 projects
c
50402 projects - #5 most used programming language
python
139335 projects - #7 most used programming language
r
7636 projects
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to miniparquet

FileConvert
Converts between file formats such as CSV and Parquet
Stars: ✭ 14 (-63.16%)
Mutual labels:  parquet-files, parquet-cpp
Vscode Data Preview
Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
Stars: ✭ 245 (+544.74%)
Mutual labels:  parquet
Kglab
Graph-Based Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, RDFlib, pySHACL, RAPIDS, NetworkX, iGraph, PyVis, pslpython, pyarrow, etc.
Stars: ✭ 98 (+157.89%)
Mutual labels:  parquet
Parquet Rs
Apache Parquet implementation in Rust
Stars: ✭ 144 (+278.95%)
Mutual labels:  parquet
Parquet Go
Go package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.
Stars: ✭ 114 (+200%)
Mutual labels:  parquet
Sqlite Parquet Vtable
A SQLite vtable extension to read Parquet files
Stars: ✭ 167 (+339.47%)
Mutual labels:  parquet
Parquet Mr
Apache Parquet
Stars: ✭ 1,278 (+3263.16%)
Mutual labels:  parquet
jpopup
Simple lightweight (<2kB) javascript popup modal plugin
Stars: ✭ 27 (-28.95%)
Mutual labels:  dependency-free
Awkward 0.x
Manipulate arrays of complex data structures as easily as Numpy.
Stars: ✭ 216 (+468.42%)
Mutual labels:  parquet
Kartothek
A consistent table management library in python
Stars: ✭ 144 (+278.95%)
Mutual labels:  parquet
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (+268.42%)
Mutual labels:  parquet
Amazon S3 Find And Forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Stars: ✭ 115 (+202.63%)
Mutual labels:  parquet
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+365.79%)
Mutual labels:  parquet
Parquet Index
Spark SQL index for Parquet tables
Stars: ✭ 109 (+186.84%)
Mutual labels:  parquet
openmrs-fhir-analytics
A collection of tools for extracting FHIR resources and analytics services on top of that data.
Stars: ✭ 55 (+44.74%)
Mutual labels:  parquet
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Stars: ✭ 97 (+155.26%)
Mutual labels:  parquet
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+4221.05%)
Mutual labels:  parquet
Parquetviewer
Simple windows desktop application for viewing & querying Apache Parquet files
Stars: ✭ 145 (+281.58%)
Mutual labels:  parquet
velox
The minimal PHP micro-framework.
Stars: ✭ 55 (+44.74%)
Mutual labels:  dependency-free
denoliver
A simple, dependency free static file server for Deno with possibly the worst name ever.
Stars: ✭ 94 (+147.37%)
Mutual labels:  dependency-free

miniparquet

Travis CRAN status

miniparquet is a reader for a common subset of Parquet files. miniparquet only supports rectangular-shaped data structures (no nested tables) and only the Snappy compression scheme. miniparquet has no (zero, none, 0) external dependencies and is very lightweight. It compiles in seconds to a binary size of under 1 MB.

Installation

Miniparquet comes as C++ library, a Python package and a R package. Install the R package like so:

devtools::install_github("hannesmuehleisen/miniparquet")

The C++ library can be built by typing make.

The Python package is installed using python setup.py install

Usage

Use the R package like so: df <- miniparquet::parquet_read("example.parquet")

Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:

df <- data.table::rbindlist(lapply(Sys.glob("some-folder/part-*.parquet"), miniparquet::parquet_read))

If you find a file that should be supported but isn't, please open an issue here with a link to the file.

Use the Python package like so: miniparquet.read('example.parquet'). You can convert the result to a Pandas dataframe like so: pandas.DataFrame.from_dict(miniparquet.read('example.parquet'))

Performance

miniparquet is quite fast, on my laptop (I7-4578U) it can read compressed Parquet files at over 200 MB/s using only a single thread. Previously, there was a comparision with the arrow package here, but it appeared that results were caused by a bug which is fixed.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].