Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → awesome-spark → Awesome Spark

awesome-spark / Awesome Spark

Licence: cc0-1.0

A curated list of awesome Apache Spark packages and resources.

Labels

awesome apache-spark pyspark

Projects that are alternatives of or similar to Awesome Spark

spark-twitter-sentiment-analysis

Sentiment Analysis of a Twitter Topic with Spark Structured Streaming

Stars: ✭ 55 (-94.82%)

Mutual labels: apache-spark, pyspark

SynapseML

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+216.21%)

Mutual labels: apache-spark, pyspark

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-96.32%)

Mutual labels: apache-spark, pyspark

spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …

Stars: ✭ 23 (-97.83%)

Mutual labels: apache-spark, pyspark

pyspark-asyncactions

Asynchronous actions for PySpark

Stars: ✭ 30 (-97.17%)

Mutual labels: apache-spark, pyspark

isarn-sketches-spark

Routines and data structures for using isarn-sketches idiomatically in Apache Spark

Stars: ✭ 28 (-97.36%)

Mutual labels: apache-spark, pyspark

jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Stars: ✭ 78 (-92.65%)

Mutual labels: apache-spark, pyspark

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-85.86%)

Mutual labels: apache-spark, pyspark

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (-89.54%)

Mutual labels: apache-spark, pyspark

pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Stars: ✭ 115 (-89.16%)

Mutual labels: apache-spark, pyspark

Live log analyzer spark

Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.

Stars: ✭ 14 (-98.68%)

Mutual labels: apache-spark, pyspark

Spark Gotchas

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

Stars: ✭ 308 (-70.97%)

Mutual labels: apache-spark, pyspark

Quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

Stars: ✭ 217 (-79.55%)

Mutual labels: apache-spark, pyspark

learn-by-examples

Real-world Spark pipelines examples

Stars: ✭ 84 (-92.08%)

Mutual labels: apache-spark, pyspark

Azure Cosmosdb Spark

Apache Spark Connector for Azure Cosmos DB

Stars: ✭ 165 (-84.45%)

Mutual labels: apache-spark, pyspark

Sparkora

Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟

Stars: ✭ 51 (-95.19%)

Mutual labels: apache-spark, pyspark

Mmlspark

Simple and Distributed Machine Learning

Stars: ✭ 2,899 (+173.23%)

Mutual labels: pyspark, apache-spark

Pyspark Stubs

Apache (Py)Spark type annotations (stub files).

Stars: ✭ 98 (-90.76%)

Mutual labels: apache-spark, pyspark

Spark-for-data-engineers

Apache Spark for data engineers

Stars: ✭ 22 (-97.93%)

Mutual labels: apache-spark, pyspark

mmtf-workshop-2018

Structural Bioinformatics Training Workshop & Hackathon 2018

Stars: ✭ 50 (-95.29%)

Mutual labels: apache-spark, pyspark

View All Similar Projects ➔

Awesome Spark

A curated list of awesome Apache Spark packages and resources.

Apache Spark is an open-source cluster-computing framework. Originally developed at the Wikipedia 2017).

Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.

Contents

Packages

Language Bindings

Notebooks and IDEs

General Purpose Libraries

SQL Data Sources

Storage

Bioinformatics

GIS

Time Series Analytics

Graph Processing

Machine Learning Extension

Middleware

Utilities

Natural Language Processing

Streaming

Interfaces

Testing

Web Archives

Workflow Management

Resources

Books

Papers

MOOCS

Workshops

Projects Using Spark

Blogs

Docker Images

Miscellaneous

Packages

Language Bindings

Flambo - Clojure DSL.

Mobius - C# bindings (Deprecated in favor of .NET for Apache Spark).

.NET for Apache Spark - .NET bindings.

sparklyr - An alternative R backend, using dplyr.

sparkle - Haskell on Apache Spark.

Notebooks and IDEs

almond - A scala kernel for Jupyter.

Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.

Polynote - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Orginating from Netflix.

Spark Notebook - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).

sparkmagic - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

Succinct - Support for efficient queries on compressed data.

itachi - A library that brings useful functions from modern database management systems to Apache Spark.

SQL Data Sources

SparkSQL has serveral built-in Data Sources for files. These include csv, json, parquet, orc, and avro. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.

Spark CSV - CSV reader and writer (obsolete since Spark 2.0 [SPARK-12833]).

Spark Avro - Apache Avro reader and writer (obselete since Spark 2.4 [SPARK-24768]).

Spark XML - XML parser and writer.

Spark-Mongodb - MongoDB reader and writer.

Spark Cassandra Connector - Cassandra support including data source and API and support for arbitrary queries.

Spark Riak Connector - Riak TS & Riak KV connector.

Mongo-Spark - Official MongoDB connector.

OrientDB-Spark - Official OrientDB connector.

Storage

Delta Lake - Storage layer with ACID transactions.

Bioinformatics

ADAM - Set of tools designed to analyse genomics data.

Hail - Genetic analysis framework.

GIS

Magellan - Geospatial analytics using Spark.

GeoSpark - Cluster computing system for processing large-scale spatial data.

Time Series Analytics

Spark-Timeseries - Scala / Java / Python library for interacting with time series data on Apache Spark.

flint - A time series library for Apache Spark.

Graph Processing

Mazerunner - Graph analytics platform on top of Neo4j and GraphX.

GraphFrames - Data frame based graph API.

neo4j-spark-connector - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.

SparklingGraph - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).

Machine Learning Extension

Clustering4Ever Scala and Spark API to benchmark and analyse clustering algorithms on any vectorization you can generate

dbscan-on-spark - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc and based on the paper from He, Yaobin, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data.

Apache SystemML - Declarative machine learning framework on top of Spark.

Mahout Spark Bindings [status unknown] - linear algebra DSL and optimizer with R-like syntax.

spark-sklearn - Scikit-learn integration with distributed model training.

KeystoneML - Type safe machine learning pipelines with RDDs.

JPMML-Spark - PMML transformer library for Spark ML.

Distributed Keras - Distributed deep learning framework with PySpark and Keras.

ModelDB - A system to manage machine learning models for spark.ml and scikit-learn .

Sparkling Water - H2O interoperability layer.

BigDL - Distributed Deep Learning library.

MLeap - Execution engine and serialization format which supports deployment of o.a.s.ml models without dependency on SparkSession.

Middleware

Livy - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.

spark-jobserver - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.

Mist - Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.

Apache Toree - IPython protocol based middleware for interactive applications.

Kyuubi - Improved implementation of Thrift JDBC/ODBC Serve.

Monitoring

Data Mechanics Delight - Cross-platform monitoring tool (Spark UI / Spark History Server replacement).

Utilities

silex - Collection of tools varying from ML extensions to additional RDD methods.

sparkly - Helpers & syntactic sugar for PySpark.

pyspark-stubs - Static type annotations for PySpark (obsolete since Spark 3.1. See SPARK-32681).

Flintrock - A command-line tool for launching Spark clusters on EC2.

Optimus - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.

Natural Language Processing

spark-corenlp - DataFrame wrapper for Stanford CoreNLP.

spark-nlp - Natural language processing library built on top of Apache Spark ML.

Streaming

Apache Bahir - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Interfaces

Apache Beam - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.

Blaze - Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark DataFrames and RDDs.

Koalas - Pandas DataFrame API on top of Apache Spark.

Testing

deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

spark-testing-base - Collection of base test classes.

spark-fast-tests - A lightweight and fast testing framework.

Web Archives

Archives Unleashed Toolkit - Open-source toolkit for analyzing web archives.

Workflow Management

Cromwell - Workflow management system with Spark backend.

Resources

Books

Learning Spark, Lightning-Fast Big Data Analysis - Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.

Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.

Mastering Apache Spark - Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.

Spark Gotchas - Subjective compilation of tips, tricks and common programming mistakes.

Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for Spark application development and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo here.

Papers

Large-Scale Intelligent Microservices - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.

Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.

Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.

MOOCS

Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.

Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of Functional Programming in Scala Specialization.

Workshops

AMP Camp - Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Projects Using Spark

Oryx 2 - Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.

Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.

PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

Crossdata - Data integration platform with extended DataSource API and multi-user environment.

Blogs

Spark Technology Center - Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.

Docker Images

jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.

sequenceiq/docker-spark - Yarn images from SequenceIQ.

Miscellaneous

Spark with Scala Gitter channel - "A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.

Apache Spark User List and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.

References

Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.

License

This work (Awesome Spark, by https://github.com/awesome-spark/awesome-spark), identified by Maciej Szymkiewicz, is free of known copyright restrictions.

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This compilation is not endorsed by The Apache Software Foundation.

Inspired by sindresorhus/awesome.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 1,061

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (24) 🔗