All Projects → Petastorm → Similar Projects or Alternatives

154 Open source projects that are alternatives of or similar to Petastorm

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .

Stars: ✭ 55 (-95.04%)

Mutual labels: parquet

Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

Stars: ✭ 696 (-37.18%)

Mutual labels: pyspark

DaFlow

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-97.83%)

Mutual labels: parquet

mmtf-workshop-2018

Structural Bioinformatics Training Workshop & Hackathon 2018

Stars: ✭ 50 (-95.49%)

Mutual labels: pyspark

jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Stars: ✭ 78 (-92.96%)

Mutual labels: pyspark

Pucket

Bucketing and partitioning system for Parquet

Stars: ✭ 29 (-97.38%)

Mutual labels: parquet

pyspark-k8s-boilerplate

Boilerplate for PySpark on Cloud Kubernetes

Stars: ✭ 24 (-97.83%)

Mutual labels: pyspark

spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

Stars: ✭ 25 (-97.74%)

Mutual labels: pyspark

hadoop-etl-udfs

The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL

Stars: ✭ 17 (-98.47%)

Mutual labels: parquet

Spark Syntax

This is a repo documenting the best practices in PySpark.

Stars: ✭ 412 (-62.82%)

Mutual labels: pyspark

pyspark-asyncactions

Asynchronous actions for PySpark

Stars: ✭ 30 (-97.29%)

Mutual labels: pyspark

Node Parquet

NodeJS module to access apache parquet format files

Stars: ✭ 46 (-95.85%)

Mutual labels: parquet

jobAnalytics and search

JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.

Stars: ✭ 25 (-97.74%)

Mutual labels: pyspark

columnify

Make record oriented data to columnar format.

Stars: ✭ 28 (-97.47%)

Mutual labels: parquet

incubator-linkis

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

Stars: ✭ 2,459 (+121.93%)

Mutual labels: pyspark

terraform-aws-kinesis-firehose

This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.

Stars: ✭ 25 (-97.74%)

Mutual labels: parquet

Iceberg

Iceberg is a table format for large, slow-moving tabular data

Stars: ✭ 393 (-64.53%)

Mutual labels: parquet

Sparkora

Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟

Stars: ✭ 51 (-95.4%)

Mutual labels: pyspark

HybridBackend

Efficient training of deep recommenders on cloud.

Stars: ✭ 30 (-97.29%)

Mutual labels: parquet

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-96.48%)

Mutual labels: pyspark

Sparkling Titanic

Training models with Apache Spark, PySpark for Titanic Kaggle competition

Stars: ✭ 12 (-98.92%)

Mutual labels: pyspark

flask-spark-docker

Just a boilerplate for PySpark and Flask

Stars: ✭ 32 (-97.11%)

Mutual labels: pyspark

data processing course

Some class materials for a data processing course using PySpark

Stars: ✭ 50 (-95.49%)

Mutual labels: pyspark

OSCI

Open Source Contributor Index

Stars: ✭ 107 (-90.34%)

Mutual labels: pyspark

Choetl

ETL Framework for .NET / c# (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)

Stars: ✭ 372 (-66.43%)

Mutual labels: parquet

pyspark-algorithms

PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2

Stars: ✭ 72 (-93.5%)

Mutual labels: pyspark

Azure-Databricks-NYC-Taxi-Workshop

An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset

Stars: ✭ 71 (-93.59%)

Mutual labels: pyspark

spark-twitter-sentiment-analysis

Sentiment Analysis of a Twitter Topic with Spark Structured Streaming

Stars: ✭ 55 (-95.04%)

Mutual labels: pyspark

Gcs Tools

GCS support for avro-tools, parquet-tools and protobuf

Stars: ✭ 57 (-94.86%)

Mutual labels: parquet

parquet-flinktacular

How to use Parquet in Flink

Stars: ✭ 29 (-97.38%)

Mutual labels: parquet

centurion

Kotlin Bigdata Toolkit

Stars: ✭ 320 (-71.12%)

Mutual labels: parquet

qsv

CSVs sliced, diced & analyzed.

Stars: ✭ 438 (-60.47%)

Mutual labels: parquet

Parquet Cpp

Apache Parquet

Stars: ✭ 339 (-69.4%)

Mutual labels: parquet

jgit-spark-connector

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

Stars: ✭ 71 (-93.59%)

Mutual labels: pyspark

experiments

Code examples for my blog posts

Stars: ✭ 21 (-98.1%)

Mutual labels: parquet

pyspark-cassandra

pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4

Stars: ✭ 70 (-93.68%)

Mutual labels: pyspark

Spark Tdd Example

A simple Spark TDD example

Stars: ✭ 23 (-97.92%)

Mutual labels: pyspark

openmrs-fhir-analytics

A collection of tools for extracting FHIR resources and analytics services on top of that data.

Stars: ✭ 55 (-95.04%)

Mutual labels: parquet

big data

A collection of tutorials on Hadoop, MapReduce, Spark, Docker

Stars: ✭ 34 (-96.93%)

Mutual labels: pyspark

workshop-spark

Código para workshops Spark com ambiente de desenvolvimento em docker

Stars: ✭ 27 (-97.56%)

Mutual labels: pyspark

Pyspark Boilerplate

A boilerplate for writing PySpark Jobs

Stars: ✭ 318 (-71.3%)

Mutual labels: pyspark

Morphl Community Edition

MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization

Stars: ✭ 253 (-77.17%)

Mutual labels: pyspark

machine-learning-course

Machine Learning Course @ Santa Clara University

Stars: ✭ 17 (-98.47%)

Mutual labels: pyspark

Gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Stars: ✭ 216 (-80.51%)

Mutual labels: pyspark

Optimus

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Stars: ✭ 986 (-11.01%)

Mutual labels: pyspark

Spark Practice

Apache Spark (PySpark) Practice on Real Data

Stars: ✭ 200 (-81.95%)

Mutual labels: pyspark

DataEngineering

This repo contains commands that data engineers use in day to day work.

Stars: ✭ 47 (-95.76%)

Mutual labels: pyspark

Spark Iforest

Isolation Forest on Spark

Stars: ✭ 166 (-85.02%)

Mutual labels: pyspark

Elasticsearch loader

A tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch

Stars: ✭ 300 (-72.92%)

Mutual labels: parquet

Linkis

Stars: ✭ 2,323 (+109.66%)

Mutual labels: pyspark

graphique

GraphQL service for arrow tables and parquet data sets.

Stars: ✭ 28 (-97.47%)

Mutual labels: parquet

Learningapachespark

LearningApacheSpark

Stars: ✭ 155 (-86.01%)

Mutual labels: pyspark

Parquet Generator

Parquet file generator

Stars: ✭ 16 (-98.56%)

Mutual labels: parquet

sparklanes

A lightweight data processing framework for Apache Spark

Stars: ✭ 17 (-98.47%)

Mutual labels: pyspark

Rumble

⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Stars: ✭ 58 (-94.77%)

Mutual labels: parquet

Awesome Spark

A curated list of awesome Apache Spark packages and resources.

Stars: ✭ 1,061 (-4.24%)

Mutual labels: pyspark

Sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Stars: ✭ 954 (-13.9%)

Mutual labels: pyspark

Parquet Format

Apache Parquet