Top 625 spark open source projects

bigdata-fun
A complete (distributed) BigData stack, running in containers
Casper
A compiler for automatically re-targeting sequential Java code to Apache Spark.
smolder
HL7 Apache Spark Datasource
spark-demos
Collection of different demo applications using Apache Spark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
tpch-spark
TPC-H queries in Apache Spark SQL using native DataFrames API
incubator-linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Spark-PMoF
Spark Shuffle Optimization with RDMA+AEP
BigData-News
基于Spark2.2新闻网大数据实时系统项目
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
docker-spark
Apache Spark docker container image (Standalone mode)
spark-acid
ACID Data Source for Apache Spark based on Hive ACID
spark-word2vec
A parallel implementation of word2vec based on Spark
Search Ads Web Service
Online search advertisement platform & Realtime Campaign Monitoring [Maybe Deprecated]
spark-gradle-template
Apache Spark in your IDE with gradle
spark-util
low-level helpers for Apache Spark libraries and tests
openverse-catalog
Identifies and collects data on cc-licensed content across web crawl data and public apis.
awesome-AI-kubernetes
❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
spark-druid-olap
Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.
ODSC India 2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
swordfish
Open-source distribute workflow schedule tools, also support streaming task.
sparkar-volts
An extensive non-reactive Typescript framework that eases the development experience in Spark AR
experiments
Code examples for my blog posts
fastdata-cluster
Fast Data Cluster (Apache Cassandra, Kafka, Spark, Flink, YARN and HDFS with Vagrant and VirtualBox)
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
visualize-data-with-python
A Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.
spark-druid-connector
A library for querying Druid data sources with Apache Spark
ceu-cloud-class
This is the repo for the Data Engineering 3 - Cloud and Big Data Computing course delivered at the Central European University ceu.edu
microframeworks-showcase
A simple grocery list webapplication implemented with the Microframeworks Spark Java, Jodd, Ninja, Javalite, Pippo and Ratpack
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
pathling
Turn your FHIR data set into a powerful API that can be used to develop analytics applications and augment data science workflow.
dlsa
Distributed least squares approximation (dlsa) implemented with Apache Spark
flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
geotrellis-pointcloud
GeoTrellis PointCloud library to work with any pointcloud data on Spark
zoe
Zoe: Container Analytics as a Service -- mirror of https://gitlab.eurecom.fr/zoe/main/
example-health-machine-learning
This code pattern shows you how to train a machine learning model to predict type 2 diabetes using synthesized patient health records.
data-processing-pipeline
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra
TitanDataOperationSystem
最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot,Bootstrap, Echart等;
EasySparse
Sparse learning in TensorFlow using data acquired from Spark.
EngineeringTeam
와이빅타 엔지니어링팀의 자료를 정리해두는 곳입니다.
361-420 of 625 spark projects