All Categories → Data Processing → apache-spark

Top 128 apache-spark open source projects

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

✭ 247

react nodejs docker iot kafka azure spark streaming big-data apache-spark kafka-streams spark-streaming streaming-data

Mastering Spark Sql Book

The Internals of Spark SQL

✭ 234

book spark apache-spark

Pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

✭ 231

python data-science apache-spark data-processing

Awesome Ai Infrastructures

Infrastructures™ for Machine Learning Training/Inference in Production.

✭ 223

deep-learning machine-learning kubernetes awesome-list artificial-intelligence apache-spark quantization model-compression pruning

Spark Workshop

Apache Spark™ and Scala Workshops

✭ 224

html spark workshop apache-spark

Quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

✭ 217

python apache-spark pyspark

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

✭ 215

java scala spark big-data hadoop bigdata apache-spark

Learning Apache Spark

Notes on Apache Spark (pyspark)

✭ 211

html machine-learning apache-spark

Analytics Zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

✭ 2,448

python scala Jupyter Notebook shell java Dockerfile pytorch apache-spark keras-tensorflow bigdl distributed-deep-learning deep-neural-network analytics-zoo

Sparktorch

Train and run Pytorch models on Apache Spark.

✭ 195

python deep-learning pytorch inference distributed-computing apache-spark pipelines

Bigdata Playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

✭ 177

python typescript scala nodejs machine-learning docker angular graphql mongodb kafka big-data hadoop apache-spark twitter-api hbase avro parquet spark-streaming

Azure Cosmosdb Spark

Apache Spark Connector for Azure Cosmos DB

✭ 165

scala jupyter-notebook spark apache-spark pyspark connector

Whylogs Java

Profile and monitor your ML data pipeline end-to-end

✭ 164

java dataset spark statistics apache-spark

Spark Atlas Connector

A Spark Atlas connector to track data lineage in Apache Atlas

✭ 160

scala apache-spark

Cheatsheets.pdf

📚 Various cheatsheets in PDF

✭ 159

python r python3 jupyter-notebook deep-learning django keras jquery jupyter numpy pandas scikit-learn cheatsheet apache-spark ipython cheatsheets cheat-sheets django-framework

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

✭ 150

python jupyter-notebook machine-learning database sql spark analytics big-data hadoop apache parallel-computing distributed-computing apache-spark dataframe pyspark hdfs

Albedo

A recommender system for discovering GitHub repos, built with Apache Spark

✭ 149

python scala machine-learning elasticsearch recommender-system apache-spark feature-engineering

Parquetviewer

Simple windows desktop application for viewing & querying Apache Parquet files

✭ 145

big-data apache-spark dot-net parquet windows-desktop

Oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

✭ 1,785

java scala shell machine-learning kafka apache-spark apache-kafka cloudera lambda-architecture oryx

Hydrograph

A visual ETL development and debugging tool for big data

✭ 144

java big-data etl apache-spark etl-framework

Scalable Data Science

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

✭ 142

scala html data-science apache-spark

Azure Event Hubs Spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

✭ 140

scala kafka azure spark streaming real-time stream microsoft apache bigdata apache-spark spark-streaming connector

Spark On Lambda

Apache Spark on AWS Lambda

✭ 137

scala aws serverless spark aws-lambda lambda big-data apache-spark

Spark Tpc Ds Performance Test

Use the TPC-DS benchmark to test Spark SQL performance

✭ 133

jupyter-notebook apache-spark ibmcode

Spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Griffon Vm

Griffon Data Science Virtual Machine

✭ 128

python ruby scala r jupyter-notebook database mysql data-science elasticsearch big-data node-js virtual-machine hadoop apache-spark

Scala Spark Tutorial

Project for James' Apache Spark with Scala course

✭ 121

scala big-data apache-spark

Spark On K8s Operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

✭ 1,780

go shell kubernetes spark apache-spark kubernetes-operator kubernetes-controller kubernetes-crd google-cloud-dataproc

Splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange

✭ 105

java scala spark storage bigdata apache-spark

Docker Spark

Apache Spark docker image

✭ 1,396

docker kubernetes dockerfile apache-spark

Pyspark Stubs

Apache (Py)Spark type annotations (stub files).

✭ 98

python apache-spark pyspark

Cuesheet

A framework for writing Spark 2.x applications in a pretty way

✭ 86

scala spark yarn apache-spark magic

Spark States

Custom state store providers for Apache Spark

✭ 83

scala spark apache state apache-spark spark-streaming

Mlflow

Open source platform for the machine learning lifecycle

✭ 10,898

python javascript java r scala CSS machine-learning ai ml apache-spark model-management mlflow

Awesome Pulsar

A curated list of Pulsar tools, integrations and resources.

✭ 57

spark prometheus messaging apache-spark apache-kafka grafana-dashboard pub-sub

Pulsar Spark

When Apache Pulsar meets Apache Spark

✭ 55

scala data-science spark apache-spark stream-processing flink data-processing batch-processing

Sparkit Learn

PySpark + Scikit-learn = Sparkit-learn

✭ 1,073

python machine-learning scikit-learn distributed-computing apache-spark

Awesome Spark

A curated list of awesome Apache Spark packages and resources.

✭ 1,061

awesome apache-spark pyspark

Spark Nkp

Natural Korean Processor for Apache Spark

✭ 50

scala nlp natural-language-processing spark apache-spark text-mining

Spark Sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

✭ 1,055

python machine-learning scikit-learn apache-spark

Apache Spark Internals

The Internals of Apache Spark

✭ 1,045

book spark apache-spark

Spark As Service Using Embedded Server

This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server

✭ 46

scala rest-api spark embedded akka apache-spark akka-http

Spark Scala Maven Example

Example Maven configuration for a Spark, Scala project

✭ 45

scala maven apache-spark

Spark Tda

SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.

✭ 45

scala machine-learning spark ml apache-spark

Spark Examples

Spark examples

✭ 41

java spark apache-spark

Dblink

Distributed Bayesian Entity Resolution in Apache Spark

✭ 38

scala bayesian-inference apache-spark mcmc

Real Time Stream Processing Engine

This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.

✭ 37

scala elasticsearch kafka spark apache-spark spark-streaming

Cloud Based Sql Engine Using Spark

Cloud-based SQL engine using SPARK where data is accessible as JDBC/ODBC data source via Spark ThriftServer.

✭ 30

java jdbc apache-spark

Spark Flamegraph

Easy CPU Profiling for Apache Spark applications

✭ 30

shell spark apache-spark

Datahacksummit 2017

Apache Zeppelin notebooks for Recommendation Engines using Keras and Machine Learning on Apache Spark

✭ 30

jupyter-notebook deep-learning machine-learning keras apache-spark recommendation-engine

Spark Streaming Monitoring With Lightning

Plot live-stats as graph from ApacheSpark application using Lightning-viz

✭ 15

scala realtime bigdata apache-spark monitoring-tool spark-streaming

Live log analyzer spark

Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.

✭ 14

python spark analytics apache-spark pyspark

Mobius

C# and F# language binding and extensions to Apache Spark

✭ 929

csharp fsharp dataset spark streaming bigdata apache-spark dataframe spark-streaming mapreduce

Goodreads etl pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

✭ 793

python spark s3 scheduler apache-spark airflow data-engineering etl-framework redshift

Sparklyr

R interface for Apache Spark

✭ 775

r machine-learning spark rstats ide distributed apache-spark dplyr

Kafka Storm Starter

Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

✭ 728

scala kafka spark integration apache-spark avro apache-kafka storm

Dist Keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

✭ 613

python deep-learning machine-learning tensorflow data-science keras hadoop apache-spark optimization-algorithms

Flintrock

A command-line tool for launching Apache Spark clusters.

✭ 568

python orchestration apache-spark ec2

Streaming Readings

Streaming System 相关的论文读物