All Categories → Data Processing → apache-spark

Top 128 apache-spark open source projects

Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Mastering Spark Sql Book
The Internals of Spark SQL
Pysparkling
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
Spark Workshop
Apache Spark™ and Scala Workshops
Quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Learning Apache Spark
Notes on Apache Spark (pyspark)
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Azure Cosmosdb Spark
Apache Spark Connector for Azure Cosmos DB
Whylogs Java
Profile and monitor your ML data pipeline end-to-end
Spark Atlas Connector
A Spark Atlas connector to track data lineage in Apache Atlas
Albedo
A recommender system for discovering GitHub repos, built with Apache Spark
Parquetviewer
Simple windows desktop application for viewing & querying Apache Parquet files
Oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Hydrograph
A visual ETL development and debugging tool for big data
Scalable Data Science
Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.
Spark Tpc Ds Performance Test
Use the TPC-DS benchmark to test Spark SQL performance
Scala Spark Tutorial
Project for James' Apache Spark with Scala course
Spark On K8s Operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Docker Spark
Apache Spark docker image
Pyspark Stubs
Apache (Py)Spark type annotations (stub files).
Cuesheet
A framework for writing Spark 2.x applications in a pretty way
Spark States
Custom state store providers for Apache Spark
Mlflow
Open source platform for the machine learning lifecycle
Awesome Pulsar
A curated list of Pulsar tools, integrations and resources.
Awesome Spark
A curated list of awesome Apache Spark packages and resources.
Spark Nkp
Natural Korean Processor for Apache Spark
Spark Sklearn
(Deprecated) Scikit-learn integration package for Apache Spark
Apache Spark Internals
The Internals of Apache Spark
Spark As Service Using Embedded Server
This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server
Spark Scala Maven Example
Example Maven configuration for a Spark, Scala project
Spark Tda
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
Dblink
Distributed Bayesian Entity Resolution in Apache Spark
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Cloud Based Sql Engine Using Spark
Cloud-based SQL engine using SPARK where data is accessible as JDBC/ODBC data source via Spark ThriftServer.
Spark Flamegraph
Easy CPU Profiling for Apache Spark applications
Datahacksummit 2017
Apache Zeppelin notebooks for Recommendation Engines using Keras and Machine Learning on Apache Spark
Spark Streaming Monitoring With Lightning
Plot live-stats as graph from ApacheSpark application using Lightning-viz
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Mobius
C# and F# language binding and extensions to Apache Spark
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Dist Keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Flintrock
A command-line tool for launching Apache Spark clusters.
Openscoring
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
1-60 of 128 apache-spark projects