All Categories → Data Processing → spark

Top 625 spark open source projects

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

✭ 372

kafka spark ide hadoop scheduler etl hive hbase mask portal

Spark Structured Streaming Book

The Internals of Spark Structured Streaming

✭ 371

book spark apache-spark

Sparkmeasure

This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.

✭ 368

scala spark apache-spark performance-metrics

Sidekick

High Performance HTTP Sidecar Load Balancer

✭ 366

go kubernetes proxy spark bigdata load-balancer

Kyuubi

Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark

✭ 363

scala sql spark analytics yarn jdbc hive thrift sql-query multi-tenant odbc multi-tenancy

Metorikku

A simplified, lightweight ETL Framework based on Apache Spark

✭ 361

scala sql spark big-data etl distributed-computing etl-framework

Sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

✭ 362

java search spark distributed-systems big-data search-engine information-retrieval solr web-crawler

Sparkstreaming

Spark Streaming+Flume+Kafka+HBase+Hadoop+Zookeeper实现实时日志分析统计；SpringBoot+Echarts实现数据可视化展示

✭ 349

java scala spark

Oap

Optimized Analytics Package for Spark* Platform

✭ 343

scala spark parquet

Sparklens

Qubole Sparklens tool for performance tuning Apache Spark

✭ 345

scala performance spark simulation cluster scheduler performance-analysis scheduling performance-metrics performance-tuning performance-visualization

Scalnet

A Scala wrapper for Deeplearning4j, inspired by Keras. Scala + DL + Spark + GPUs

✭ 342

scala spark deeplearning sbt deeplearning4j

Iql

An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)

✭ 341

javascript spark

Ytk Learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

✭ 337

java machine-learning spark distributed hadoop logistic-regression factorization-machines

Wirbelsturm

Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

✭ 332

shell kafka spark puppet vagrant apache-spark apache-kafka storm

Sparklint

A tool for monitoring and tuning Spark jobs for efficiency.

✭ 316

scala spark performance-analysis

Cook

Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark

✭ 314

clojure kubernetes spark cluster scheduler mesos

Clickhouse Native Jdbc

ClickHouse Native Protocol JDBC implementation

✭ 310

java database spark analytics jdbc clickhouse

Coolplayspark

酷玩 Spark: Spark 源代码解析、Spark 类库等

✭ 3,318

scala spark apache-spark spark-streaming sparkcore structured-streaming

Learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

✭ 307

scala spark apache-spark

Crayon

Simple framework agnostic UI router for SPAs

✭ 310

typescript react vue spark router svelte

Delta

An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.

✭ 3,903

scala python java spark analytics big-data acid

Spline

Data Lineage Tracking And Visualization Solution

✭ 306

scala visualization spark tracking hadoop bigdata

Zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

✭ 303

python jupyter-notebook security kafka networking spark data-analysis pandas scikit-learn

Awesome Ada

A curated list of awesome resources related to the Ada and SPARK programming language

✭ 299

awesome spark

Elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

✭ 298

python cloud ansible azure spark cluster clustering hadoop gcp hpc ec2

Spark Hbase Connector

Connect Spark to HBase for reading and writing data with ease

✭ 299

scala spark hbase

Spark Notebook

Interactive and Reactive Data Science using Scala and Spark.

✭ 3,081

javascript scala Jupyter Notebook HTML Less CSS data-science spark reactive notebook apache-spark

Spark Druid Olap

Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.

✭ 282

scala spark business-intelligence

Cloudflow

Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.

✭ 278

scala kubernetes spark akka flink streaming-data

Hbase Rdd

Spark RDD to read, write and delete from HBase

✭ 277

scala spark hbase

Datavec

ETL Library for Machine Learning - data pipelines, data munging and wrangling

✭ 272

java machine-learning spark schema pipeline formatter etl transformations

Helk

The Hunting ELK

✭ 3,097

Jupyter Notebook CSS shell jupyter-notebook docker elasticsearch spark kibana logstash threat-hunting elk elastic elk-stack dockerhub hunting hunting-platforms

Docker Spark Cluster

A simple spark standalone cluster for your testing environment purposses

✭ 261

dockerfile docker-compose spark developer-tools bigdata

Around Dataengineering

A Data Engineering & Machine Learning Knowledge Hub

✭ 257

machine-learning devops spark infrastructure datascience airflow data-engineering

Sk Dist

Distributed scikit-learn meta-estimators in PySpark

✭ 260

python machine-learning data-science spark scikit-learn ml

Spark Jupyter Aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

✭ 259

jupyter-notebook aws spark jupyter aws-s3 apache-spark ec2 aws-ec2

Succinct

Enabling queries on compressed data.

✭ 257

java scala spark big-data compression

Big Data Rosetta Code

Code snippets for solving common big data problems in various platforms. Inspired by Rosetta Code

✭ 254

scala spark bigdata

Ibis

A pandas-like deferred expression system, with first-class SQL support

✭ 1,630

python C++pandas hadoop hdfs spark impala ibis

spark-structured-streaming-examples

Spark structured streaming examples with using of version 3.0.0

✭ 23

scala shell Batchfile spark apache-spark structured-streaming spark-sql spark-structured-streaming delta-lake

laravel-spark-camera

Profile Photo Camera support for Laravel Spark

✭ 30

javascript HTML PHP laravel spark camera laravel-spark

sparkProjectTemplate.g8

Template for Spark Projects

✭ 77

scala shell spark g8 apachespark

Book

本项目收藏这些年来看过或者听过的一些不错的书籍，在整理文件时看见这些，发现删掉有点可惜，放着又太浪费空间，本着分享的原则，就把它们共享出来，一方面给需要的读者提供这些书籍，另一方面也是一种像知识库的积累吧

✭ 47

mysql linux pdf spark spring

kafka-spark-streaming-zeppelin-docker

One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)

✭ 82

docker streaming kafka spark docker-compose zeppelin spark-streaming-kafka spark-kafka kafka-spark kafka-spark-streaming kafka-zeppelin spark-zeppelin

spark-http-stream

spark structured streaming via HTTP communication

✭ 17

scala java http spark spark-structured-streaming

basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

✭ 25

typescript Vue python javascript HTML Dockerfile shell emr spark hadoop pipeline etl pyspark informatica odi

daf-kylo

Kylo integration with PDND (previously DAF).

✭ 20

java shell Dockerfile groovy javascript Makefile docker kubernetes elasticsearch spark mariadb activemq nifi kylo daf daf-core

dllib

dllib is a distributed deep learning library running on Apache Spark

✭ 32

CSS scala HTML shell javascript spark deep-learning mllib

Spotify-Song-Recommendation-ML

UC Berkeley team's submission for RecSys Challenge 2018

✭ 70

Jupyter Notebook python spotify data-science data-mining spark collaborative-filtering data-analysis recommender-system spark-mllib song-recommender

spark learning

尚硅谷大数据Spark-2019版最新 Spark 学习

✭ 42

scala java spark spark-sql spark-core

spark-data-sources

Developing Spark External Data Sources using the V2 API

✭ 36

java scala spark spark-sql data-sources

prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

✭ 54