All Projects → coady → graphique

coady / graphique

Licence: other
GraphQL service for arrow tables and parquet data sets.

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to graphique

Awkward 0.x
Manipulate arrays of complex data structures as easily as Numpy.
Stars: ✭ 216 (+671.43%)
Mutual labels:  arrow, parquet
Kartothek
A consistent table management library in python
Stars: ✭ 144 (+414.29%)
Mutual labels:  arrow, parquet
Roapi
Create full-fledged APIs for static datasets without writing a single line of code.
Stars: ✭ 253 (+803.57%)
Mutual labels:  arrow, parquet
Vscode Data Preview
Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
Stars: ✭ 245 (+775%)
Mutual labels:  arrow, parquet
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-14.29%)
Mutual labels:  parquet
odbc2parquet
A command line tool to query an ODBC data source and write the result into a parquet file.
Stars: ✭ 95 (+239.29%)
Mutual labels:  parquet
databricks-notebooks
Collection of Databricks and Jupyter Notebooks
Stars: ✭ 19 (-32.14%)
Mutual labels:  parquet
terraform-aws-kinesis-firehose
This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.
Stars: ✭ 25 (-10.71%)
Mutual labels:  parquet
vinum
Vinum is a SQL processor for Python, designed for data analysis workflows and in-memory analytics.
Stars: ✭ 57 (+103.57%)
Mutual labels:  arrow
polars
Fast multi-threaded DataFrame library in Rust | Python | Node.js
Stars: ✭ 6,368 (+22642.86%)
Mutual labels:  arrow
arrow-datafusion
Apache Arrow DataFusion SQL Query Engine
Stars: ✭ 2,360 (+8328.57%)
Mutual labels:  arrow
hadoop-etl-udfs
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-39.29%)
Mutual labels:  parquet
arrow-site
Mirror of Apache Arrow site
Stars: ✭ 16 (-42.86%)
Mutual labels:  arrow
IMCtermite
Enables extraction of measurement data from binary files with extension 'raw' used by proprietary software imcFAMOS/imcSTUDIO and facilitates its storage in open source file formats
Stars: ✭ 20 (-28.57%)
Mutual labels:  parquet
parquet2
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
Stars: ✭ 157 (+460.71%)
Mutual labels:  parquet
columnify
Make record oriented data to columnar format.
Stars: ✭ 28 (+0%)
Mutual labels:  parquet
hood
The plugin to manage benchmarks on your CI
Stars: ✭ 17 (-39.29%)
Mutual labels:  arrow
parquet-usql
A custom extractor designed to read parquet for Azure Data Lake Analytics
Stars: ✭ 13 (-53.57%)
Mutual labels:  parquet
wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-32.14%)
Mutual labels:  parquet
tooltip
[DEPRECATED] The tooltip that has all the right moves
Stars: ✭ 133 (+375%)
Mutual labels:  arrow

image image image image image image image image image

GraphQL service for arrow tables and parquet data sets. The schema for a query API is derived automatically.

Usage

% env PARQUET_PATH=... uvicorn graphique.service:app

Open http://localhost:8000/graphql to try out the API in GraphiQL. There is a test fixture at ./tests/fixtures/zipcodes.parquet.

% python3 -m graphique.schema ...

outputs the graphql schema for a parquet data set.

Configuration

Graphique uses Starlette's config: in environment variables or a .env file. Config variables are used as input to parquet dataset.

  • PARQUET_PATH: path to the parquet directory or file
  • INDEX = []: partition keys or names of columns which represent a sorted composite index
  • FEDERATED = '': field name to extend type Query with a federated Table
  • DEBUG = False: run service in debug mode, which includes timing
  • DICTIONARIES = []: names of columns to read as dictionaries
  • COLUMNS = []: names of columns to read at startup; * indicates all
  • FILTERS = {}: json Queries for which rows to read at startup

API

types

  • Table: an arrow Table; the primary interface.
  • Column: an arrow Column (a.k.a. ChunkedArray). Each arrow data type has a corresponding column implementation: Boolean, Int, Long, Float, Decimal, Date, DateTime, Time, Duration, Base64, String, List, Struct. All columns have a values field for their list of scalars. Additional fields vary by type.
  • Row: scalar fields. Arrow tables are column-oriented, and graphique encourages that usage for performance. A single row field is provided for convenience, but a field for a list of rows is not. Requesting parallel columns is far more efficient.

selection

  • slice: contiguous selection of rows
  • search: binary search if the table is sorted, i.e., provides an index
  • filter: select rows from predicate functions

projection

  • columns: provides a field for every Column in the schema
  • column: access a column of any type by name
  • row: provides a field for each scalar of a single row
  • apply: transform columns by applying a function

aggregation

  • group: group by given columns, transforming the others into list columns
  • partition: partition on adjacent values in given columns, transforming the others into list columns
  • aggregate: apply reduce functions to list columns
  • tables: return a list of tables by splitting on the scalars in list columns

ordering

  • sort: sort table by given columns
  • min: select rows with smallest values
  • max: select rows with largest values

Performance

Graphique relies on native PyArrow routines wherever possible. Otherwise it falls back to using NumPy or custom optimizations.

By default, datasets are read on-demand, with only the necessary columns selected. Additionally filter(query: ...) is optimized to filter rows while reading the dataset. Although graphique is a running service, parquet is performant at reading a subset of data. Optionally specify COLUMNS to read a subset of columns (or *) at startup, trading-off memory for latency. Similarly specify FILTERS in the json format of the Query input type to read a subset of rows at startup.

Specifying an INDEX indicates the table is sorted, and enables the binary search field. Specifying just INDEX without reading (FILTERS or COLUMNS) is allowed but only recommended if it corresponds to the partition keys. In that case, search(...) is functionally equivalent to filter(query: ...).

Installation

% pip install graphique[server]

Dependencies

  • pyarrow >=8
  • strawberry-graphql[asgi] >=0.109
  • uvicorn (or other ASGI server)

Tests

100% branch coverage.

% pytest [--cov]

Changes

0.8

  • Pyarrow >=8 required
  • Grouping and aggregation integrated
  • AbstractTable interface renamed to Dataset
  • Binary scalar renamed to Base64

0.7

  • Pyarrow >=7 required
  • FILTERS use query syntax and trigger reading the dataset
  • FEDERATED field configuration
  • List columns support sorting and filtering
  • Group by and aggregate optimizations
  • Dataset scanning

0.6

  • Pyarrow >=6 required
  • Group by optimized and replaced unique field
  • Dictionary related optimizations
  • Null consistency with arrow count functions

0.5

  • Pyarrow >=5 required
  • Stricter validation of inputs
  • Columns can be cast to another arrow data type
  • Grouping uses large list arrays with 64-bit counts
  • Datasets are read on-demand or optionally at startup

0.4

  • Pyarrow >=4 required
  • sort updated to use new native routines
  • partition tables by adjacent values and differences
  • filter supports unknown column types using tagged union pattern
  • Groups replaced with Table.tables and Table.aggregate fields
  • Tagged unions used for filter, apply, and partition functions

0.3

  • Pyarrow >=3 required
  • any and all fields
  • String column split field

0.2

  • ListColumn and StructColumn types
  • Groups type with aggregate field
  • group and unique optimized
  • pyarrow >= 2 required
  • Statistical fields: mode, stddev, variance
  • is_in, min, and max optimized
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].