Top 625 spark open source projects

Silex
something to help you spark
Waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Zemberek Nlp Server
Zemberek Türkçe NLP Java Kütüphanesi üzerine REST Docker Sunucu
Pyspark Examples
Code examples on Apache Spark using python
Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Model Serving Tutorial
Code and presentation for Strata Model Serving tutorial
Awesome Pulsar
A curated list of Pulsar tools, integrations and resources.
Net.jgp.labs.spark
Apache Spark examples exclusively in Java
Docker Hadoop
A Docker container with a full Hadoop cluster setup with Spark and Zeppelin
Utils4s
scala、spark使用过程中,各种测试用例以及相关资料整理
Spark Submit Ui
This is a based on playframwork for submit spark app
Spark Nkp
Natural Korean Processor for Apache Spark
Apache Spark Internals
The Internals of Apache Spark
Awesome Recommendation Engine
The purpose of this tiny project is to put things together with the know how that i learned from the course big data expert from formacionhadoop.com The idea is to show how to play with apache spark streaming, kafka,mongo, spark machine learning algorithms.
Spark As Service Using Embedded Server
This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server
Spark Tda
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
Delta Architecture
Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline
Gatk
Official code repository for GATK versions 4 and up
Azure Kusto Spark
Apache Spark Connector for Azure Kusto
Pixiedust
Python Helper library for Jupyter Notebooks
Snappydata
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Weblogsanalysissystem
A big data platform for analyzing web access logs
Learning Spark
零基础学习spark,大数据学习
Vagrant Projects
Vagrant projects for various use-cases with Spark, Zeppelin, IPython / Jupyter, SparkR
Spark Flamegraph
Easy CPU Profiling for Apache Spark applications
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Pucket
Bucketing and partitioning system for Parquet
Data Algorithms Book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Heracles
High performance HBase / Spark SQL engine
Spark
Apache Spark - A unified analytics engine for large-scale data processing
Interview Questions Collection
按知识领域整理面试题,包括C++、Java、Hadoop、机器学习等
Flint
A Time Series Library for Apache Spark
Tedsds
Apache Spark - Turbofan Engine Degradation Simulation Data Set example in Apache Spark
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Urhox
Urho3D extension library
✭ 13
sparkimgui
Sparkling Titanic
Training models with Apache Spark, PySpark for Titanic Kaggle competition
Mlfeature
Feature engineering toolkit for Spark MLlib.
Mare
MaRe leverages the power of Docker and Spark to run and scale your serial tools in MapReduce fashion.
Sparkjni
A heterogeneous Apache Spark framework.
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Tiledb Vcf
Efficient variant-call data storage and retrieval library using the TileDB storage library.
Spark Swagger
Spark (http://sparkjava.com/) support for Swagger (https://swagger.io/)
Mobius
C# and F# language binding and extensions to Apache Spark
Chronicler
Scala toolchain for InfluxDB
Kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Spark Scala Tutorial
A free tutorial for Apache Spark.
Parquet Generator
Parquet file generator