All Projects → awesome-spark → Awesome Spark

awesome-spark / Awesome Spark

Licence: cc0-1.0
A curated list of awesome Apache Spark packages and resources.

Projects that are alternatives of or similar to Awesome Spark

spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Stars: ✭ 55 (-94.82%)
Mutual labels:  apache-spark, pyspark
SynapseML
Simple and Distributed Machine Learning
Stars: ✭ 3,355 (+216.21%)
Mutual labels:  apache-spark, pyspark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-96.32%)
Mutual labels:  apache-spark, pyspark
spark3D
Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
Stars: ✭ 23 (-97.83%)
Mutual labels:  apache-spark, pyspark
pyspark-asyncactions
Asynchronous actions for PySpark
Stars: ✭ 30 (-97.17%)
Mutual labels:  apache-spark, pyspark
isarn-sketches-spark
Routines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (-97.36%)
Mutual labels:  apache-spark, pyspark
jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Stars: ✭ 78 (-92.65%)
Mutual labels:  apache-spark, pyspark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-85.86%)
Mutual labels:  apache-spark, pyspark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-89.54%)
Mutual labels:  apache-spark, pyspark
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (-89.16%)
Mutual labels:  apache-spark, pyspark
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-98.68%)
Mutual labels:  apache-spark, pyspark
Spark Gotchas
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Stars: ✭ 308 (-70.97%)
Mutual labels:  apache-spark, pyspark
Quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Stars: ✭ 217 (-79.55%)
Mutual labels:  apache-spark, pyspark
learn-by-examples
Real-world Spark pipelines examples
Stars: ✭ 84 (-92.08%)
Mutual labels:  apache-spark, pyspark
Azure Cosmosdb Spark
Apache Spark Connector for Azure Cosmos DB
Stars: ✭ 165 (-84.45%)
Mutual labels:  apache-spark, pyspark
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (-95.19%)
Mutual labels:  apache-spark, pyspark
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+173.23%)
Mutual labels:  pyspark, apache-spark
Pyspark Stubs
Apache (Py)Spark type annotations (stub files).
Stars: ✭ 98 (-90.76%)
Mutual labels:  apache-spark, pyspark
Spark-for-data-engineers
Apache Spark for data engineers
Stars: ✭ 22 (-97.93%)
Mutual labels:  apache-spark, pyspark
mmtf-workshop-2018
Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-95.29%)
Mutual labels:  apache-spark, pyspark

Awesome Spark Awesome

A curated list of awesome Apache Spark packages and resources.

Apache Spark is an open-source cluster-computing framework. Originally developed at the Wikipedia 2017).

Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.

Contents

Packages

Language Bindings

Notebooks and IDEs

  • almond - A scala kernel for Jupyter.
  • Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
  • Polynote - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Orginating from Netflix.
  • Spark Notebook - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
  • sparkmagic - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

  • Succinct - Support for efficient queries on compressed data.
  • itachi - A library that brings useful functions from modern database management systems to Apache Spark.

SQL Data Sources

SparkSQL has serveral built-in Data Sources for files. These include csv, json, parquet, orc, and avro. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.

Storage

  • Delta Lake - Storage layer with ACID transactions.

Bioinformatics

  • ADAM - Set of tools designed to analyse genomics data.
  • Hail - Genetic analysis framework.

GIS

  • Magellan - Geospatial analytics using Spark.
  • GeoSpark - Cluster computing system for processing large-scale spatial data.

Time Series Analytics

  • Spark-Timeseries - Scala / Java / Python library for interacting with time series data on Apache Spark.
  • flint - A time series library for Apache Spark.

Graph Processing

  • Mazerunner - Graph analytics platform on top of Neo4j and GraphX.
  • GraphFrames - Data frame based graph API.
  • neo4j-spark-connector - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
  • SparklingGraph - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).

Machine Learning Extension

Middleware

  • Livy - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
  • spark-jobserver - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
  • Mist - Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.
  • Apache Toree - IPython protocol based middleware for interactive applications.
  • Kyuubi - Improved implementation of Thrift JDBC/ODBC Serve.

Monitoring

Utilities

  • silex - Collection of tools varying from ML extensions to additional RDD methods.
  • sparkly - Helpers & syntactic sugar for PySpark.
  • pyspark-stubs - Static type annotations for PySpark (obsolete since Spark 3.1. See SPARK-32681).
  • Flintrock - A command-line tool for launching Spark clusters on EC2.
  • Optimus - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.

Natural Language Processing

Streaming

  • Apache Bahir - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Interfaces

  • Apache Beam - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
  • Blaze - Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark DataFrames and RDDs.
  • Koalas - Pandas DataFrame API on top of Apache Spark.

Testing

  • deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
  • spark-testing-base - Collection of base test classes.
  • spark-fast-tests - A lightweight and fast testing framework.

Web Archives

Workflow Management

Resources

Books

Papers

MOOCS

Workshops

Projects Using Spark

  • Oryx 2 - Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
  • Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
  • PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
  • Crossdata - Data integration platform with extended DataSource API and multi-user environment.

Blogs

  • Spark Technology Center - Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.

Docker Images

Miscellaneous

References

Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.

License

Public Domain Mark
This work (Awesome Spark, by https://github.com/awesome-spark/awesome-spark), identified by Maciej Szymkiewicz, is free of known copyright restrictions.

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This compilation is not endorsed by The Apache Software Foundation.

Inspired by sindresorhus/awesome.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].