All Projects → Petastorm → Similar Projects or Alternatives

154 Open source projects that are alternatives of or similar to Petastorm

Spark
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .
Stars: ✭ 55 (-95.04%)
Mutual labels:  parquet
Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Stars: ✭ 696 (-37.18%)
Mutual labels:  pyspark
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-97.83%)
Mutual labels:  parquet
mmtf-workshop-2018
Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-95.49%)
Mutual labels:  pyspark
jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Stars: ✭ 78 (-92.96%)
Mutual labels:  pyspark
Pucket
Bucketing and partitioning system for Parquet
Stars: ✭ 29 (-97.38%)
Mutual labels:  parquet
pyspark-k8s-boilerplate
Boilerplate for PySpark on Cloud Kubernetes
Stars: ✭ 24 (-97.83%)
Mutual labels:  pyspark
spark-extension
A library that provides useful extensions to Apache Spark and PySpark.
Stars: ✭ 25 (-97.74%)
Mutual labels:  pyspark
hadoop-etl-udfs
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-98.47%)
Mutual labels:  parquet
Spark Syntax
This is a repo documenting the best practices in PySpark.
Stars: ✭ 412 (-62.82%)
Mutual labels:  pyspark
pyspark-asyncactions
Asynchronous actions for PySpark
Stars: ✭ 30 (-97.29%)
Mutual labels:  pyspark
Node Parquet
NodeJS module to access apache parquet format files
Stars: ✭ 46 (-95.85%)
Mutual labels:  parquet
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-97.74%)
Mutual labels:  pyspark
columnify
Make record oriented data to columnar format.
Stars: ✭ 28 (-97.47%)
Mutual labels:  parquet
incubator-linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+121.93%)
Mutual labels:  pyspark
terraform-aws-kinesis-firehose
This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.
Stars: ✭ 25 (-97.74%)
Mutual labels:  parquet
Iceberg
Iceberg is a table format for large, slow-moving tabular data
Stars: ✭ 393 (-64.53%)
Mutual labels:  parquet
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (-95.4%)
Mutual labels:  pyspark
HybridBackend
Efficient training of deep recommenders on cloud.
Stars: ✭ 30 (-97.29%)
Mutual labels:  parquet
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-96.48%)
Mutual labels:  pyspark
Sparkling Titanic
Training models with Apache Spark, PySpark for Titanic Kaggle competition
Stars: ✭ 12 (-98.92%)
Mutual labels:  pyspark
flask-spark-docker
Just a boilerplate for PySpark and Flask
Stars: ✭ 32 (-97.11%)
Mutual labels:  pyspark
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (-95.49%)
Mutual labels:  pyspark
OSCI
Open Source Contributor Index
Stars: ✭ 107 (-90.34%)
Mutual labels:  pyspark
Choetl
ETL Framework for .NET / c# (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
Stars: ✭ 372 (-66.43%)
Mutual labels:  parquet
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-93.5%)
Mutual labels:  pyspark
Azure-Databricks-NYC-Taxi-Workshop
An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset
Stars: ✭ 71 (-93.59%)
Mutual labels:  pyspark
spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Stars: ✭ 55 (-95.04%)
Mutual labels:  pyspark
Gcs Tools
GCS support for avro-tools, parquet-tools and protobuf
Stars: ✭ 57 (-94.86%)
Mutual labels:  parquet
parquet-flinktacular
How to use Parquet in Flink
Stars: ✭ 29 (-97.38%)
Mutual labels:  parquet
centurion
Kotlin Bigdata Toolkit
Stars: ✭ 320 (-71.12%)
Mutual labels:  parquet
qsv
CSVs sliced, diced & analyzed.
Stars: ✭ 438 (-60.47%)
Mutual labels:  parquet
Parquet Cpp
Apache Parquet
Stars: ✭ 339 (-69.4%)
Mutual labels:  parquet
jgit-spark-connector
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Stars: ✭ 71 (-93.59%)
Mutual labels:  pyspark
experiments
Code examples for my blog posts
Stars: ✭ 21 (-98.1%)
Mutual labels:  parquet
pyspark-cassandra
pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
Stars: ✭ 70 (-93.68%)
Mutual labels:  pyspark
Spark Tdd Example
A simple Spark TDD example
Stars: ✭ 23 (-97.92%)
Mutual labels:  pyspark
openmrs-fhir-analytics
A collection of tools for extracting FHIR resources and analytics services on top of that data.
Stars: ✭ 55 (-95.04%)
Mutual labels:  parquet
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-96.93%)
Mutual labels:  pyspark
workshop-spark
Código para workshops Spark com ambiente de desenvolvimento em docker
Stars: ✭ 27 (-97.56%)
Mutual labels:  pyspark
Pyspark Boilerplate
A boilerplate for writing PySpark Jobs
Stars: ✭ 318 (-71.3%)
Mutual labels:  pyspark
Morphl Community Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (-77.17%)
Mutual labels:  pyspark
machine-learning-course
Machine Learning Course @ Santa Clara University
Stars: ✭ 17 (-98.47%)
Mutual labels:  pyspark
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (-80.51%)
Mutual labels:  pyspark
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (-11.01%)
Mutual labels:  pyspark
Spark Practice
Apache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (-81.95%)
Mutual labels:  pyspark
DataEngineering
This repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (-95.76%)
Mutual labels:  pyspark
Spark Iforest
Isolation Forest on Spark
Stars: ✭ 166 (-85.02%)
Mutual labels:  pyspark
Elasticsearch loader
A tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
Stars: ✭ 300 (-72.92%)
Mutual labels:  parquet
Linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,323 (+109.66%)
Mutual labels:  pyspark
graphique
GraphQL service for arrow tables and parquet data sets.
Stars: ✭ 28 (-97.47%)
Mutual labels:  parquet
Learningapachespark
LearningApacheSpark
Stars: ✭ 155 (-86.01%)
Mutual labels:  pyspark
Parquet Generator
Parquet file generator
Stars: ✭ 16 (-98.56%)
Mutual labels:  parquet
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-98.47%)
Mutual labels:  pyspark
Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-94.77%)
Mutual labels:  parquet
Awesome Spark
A curated list of awesome Apache Spark packages and resources.
Stars: ✭ 1,061 (-4.24%)
Mutual labels:  pyspark
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (-13.9%)
Mutual labels:  pyspark
Parquet Format
Apache Parquet
Stars: ✭ 800 (-27.8%)
Mutual labels:  parquet
Tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Stars: ✭ 274 (-75.27%)
Mutual labels:  pyspark
Spark-for-data-engineers
Apache Spark for data engineers
Stars: ✭ 22 (-98.01%)
Mutual labels:  pyspark
61-120 of 154 similar projects