All Projects → JDASoftwareGroup → Kartothek

JDASoftwareGroup / Kartothek

Licence: mit
A consistent table management library in python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Kartothek

graphique
GraphQL service for arrow tables and parquet data sets.
Stars: ✭ 28 (-80.56%)
Mutual labels:  arrow, parquet
Vscode Data Preview
Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
Stars: ✭ 245 (+70.14%)
Mutual labels:  parquet, arrow
Awkward 0.x
Manipulate arrays of complex data structures as easily as Numpy.
Stars: ✭ 216 (+50%)
Mutual labels:  parquet, arrow
Roapi
Create full-fledged APIs for static datasets without writing a single line of code.
Stars: ✭ 253 (+75.69%)
Mutual labels:  parquet, arrow
Amazon S3 Find And Forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Stars: ✭ 115 (-20.14%)
Mutual labels:  parquet
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Stars: ✭ 97 (-32.64%)
Mutual labels:  parquet
Arrow.jl
Pure Julia implementation of the apache arrow data format (https://arrow.apache.org/)
Stars: ✭ 92 (-36.11%)
Mutual labels:  arrow
Parquet Mr
Apache Parquet
Stars: ✭ 1,278 (+787.5%)
Mutual labels:  parquet
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (-2.78%)
Mutual labels:  parquet
Pydata Chicago2016 Ml Tutorial
Machine learning with scikit-learn tutorial at PyData Chicago 2016
Stars: ✭ 128 (-11.11%)
Mutual labels:  pydata
Leader Line
Draw a leader line in your web page.
Stars: ✭ 1,872 (+1200%)
Mutual labels:  arrow
Kglab
Graph-Based Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, RDFlib, pySHACL, RAPIDS, NetworkX, iGraph, PyVis, pslpython, pyarrow, etc.
Stars: ✭ 98 (-31.94%)
Mutual labels:  parquet
Drill
Apache Drill is a distributed MPP query layer for self describing data
Stars: ✭ 1,619 (+1024.31%)
Mutual labels:  parquet
Pyvtreat
vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.
Stars: ✭ 92 (-36.11%)
Mutual labels:  pydata
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+1040.28%)
Mutual labels:  parquet
Open Arrow
Open Arrow is an open-source font that contains 112 arrow symbols from U+2190 to U+21ff
Stars: ✭ 89 (-38.19%)
Mutual labels:  arrow
Blazingsql
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
Stars: ✭ 1,652 (+1047.22%)
Mutual labels:  arrow
Parquet4s
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Stars: ✭ 125 (-13.19%)
Mutual labels:  parquet
Parquet Index
Spark SQL index for Parquet tables
Stars: ✭ 109 (-24.31%)
Mutual labels:  parquet
Pymapd
Python client for OmniSci GPU-accelerated SQL engine and analytics platform
Stars: ✭ 109 (-24.31%)
Mutual labels:  pydata

Kartothek

Build Status Documentation Status codecov.io License: MIT Anaconda-Server Badge Anaconda-Server Badge

Kartothek is a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store. It stores data as datasets, which it presents as pandas DataFrames to the user. Datasets are a collection of files with the same schema that reside in a blob store. Kartothek uses a metadata definition to handle these datasets efficiently. For distributed access and manipulation of datasets Kartothek offers a Dask interface.

Storing data distributed over multiple files in a blob store (S3, ABS, GCS, etc.) allows for a fast, cost-efficient and highly scalable data infrastructure. A downside of storing data solely in an object store is that the storages themselves give little to no guarantees beyond the consistency of a single file. In particular, they cannot guarantee the consistency of your dataset. If we demand a consistent state of our dataset at all times, we need to track the state of the dataset. Kartothek frees us from having to do this manually.

The kartothek.io module provides building blocks to create and modify these datasets in data pipelines. Kartothek handles I/O, tracks dataset partitions and selects subsets of data transparently.

Installation

Installers for the latest released version are availabe at the Python package index and on conda.

# Install with pip
pip install kartothek
# Install with conda
conda install -c conda-forge kartothek

What is a (real) Kartothek?

A Kartothek (or more modern: Zettelkasten/Katalogkasten) is a tool to organize (high-level) information extracted from a source of information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].