Top 625 spark open source projects

Wedatasphere
WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!
Spark Structured Streaming Book
The Internals of Spark Structured Streaming
Sparkmeasure
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
Sidekick
High Performance HTTP Sidecar Load Balancer
Kyuubi
Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Sparkstreaming
Spark Streaming+Flume+Kafka+HBase+Hadoop+Zookeeper实现实时日志分析统计;SpringBoot+Echarts实现数据可视化展示
Oap
Optimized Analytics Package for Spark* Platform
Scalnet
A Scala wrapper for Deeplearning4j, inspired by Keras. Scala + DL + Spark + GPUs
Iql
An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)
Ytk Learn
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
Wirbelsturm
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
Sparklint
A tool for monitoring and tuning Spark jobs for efficiency.
Cook
Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark
Clickhouse Native Jdbc
ClickHouse Native Protocol JDBC implementation
Coolplayspark
酷玩 Spark: Spark 源代码解析、Spark 类库等
Learningsparkv2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Crayon
Simple framework agnostic UI router for SPAs
Delta
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Spline
Data Lineage Tracking And Visualization Solution
Zat
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
Awesome Ada
A curated list of awesome resources related to the Ada and SPARK programming language
✭ 299
awesomespark
Elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
Spark Hbase Connector
Connect Spark to HBase for reading and writing data with ease
Spark Notebook
Interactive and Reactive Data Science using Scala and Spark.
Spark Druid Olap
Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.
Cloudflow
Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.
Hbase Rdd
Spark RDD to read, write and delete from HBase
Datavec
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Docker Spark Cluster
A simple spark standalone cluster for your testing environment purposses
Sk Dist
Distributed scikit-learn meta-estimators in PySpark
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Succinct
Enabling queries on compressed data.
Big Data Rosetta Code
Code snippets for solving common big data problems in various platforms. Inspired by Rosetta Code
Ibis
A pandas-like deferred expression system, with first-class SQL support
laravel-spark-camera
Profile Photo Camera support for Laravel Spark
sparkProjectTemplate.g8
Template for Spark Projects
Book
本项目收藏这些年来看过或者听过的一些不错的书籍,在整理文件时看见这些,发现删掉有点可惜,放着又太浪费空间,本着分享的原则,就把它们共享出来,一方面给需要的读者提供这些书籍,另一方面也是一种像知识库的积累吧
kafka-spark-streaming-zeppelin-docker
One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)
spark-http-stream
spark structured streaming via HTTP communication
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
dllib
dllib is a distributed deep learning library running on Apache Spark
spark learning
尚硅谷大数据Spark-2019版最新 Spark 学习
spark-data-sources
Developing Spark External Data Sources using the V2 API
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
confluent-spark-avro
Spark UDFs to deserialize Avro messages with schemas stored in Schema Registry.
Covid19Tracker
A Robinhood style COVID-19 🦠 Android tracking app for the US. Open source and built with Kotlin.
SparkV
🤖⚡ | The most POWERFUL multipurpose chat/meme bot that will boost the activity in your server.
spark-extension
A library that provides useful extensions to Apache Spark and PySpark.