rubenafo / Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57
Projects that are alternatives of or similar to Docker Spark Cluster
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+163.16%)
Mutual labels: spark, big-data, hadoop
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+2780.7%)
Mutual labels: spark, big-data, hadoop
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+9822.81%)
Mutual labels: spark, big-data, hadoop
Bigdl
Building Large-Scale AI Applications for Distributed Big Data
Stars: ✭ 3,813 (+6589.47%)
Mutual labels: spark, big-data, hadoop
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+94.74%)
Mutual labels: big-data, spark, hadoop
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+277.19%)
Mutual labels: spark, big-data, hadoop
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-77.19%)
Mutual labels: big-data, spark, hadoop
bigdata-fun
A complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-75.44%)
Mutual labels: big-data, spark, hadoop
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+38580.7%)
Mutual labels: spark, big-data, hadoop
Spark Movie Lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Stars: ✭ 745 (+1207.02%)
Mutual labels: spark, big-data
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+1333.33%)
Mutual labels: spark, hadoop
Hadoop For Geoevent
ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Stars: ✭ 5 (-91.23%)
Mutual labels: big-data, hadoop
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Stars: ✭ 5,513 (+9571.93%)
Mutual labels: spark, big-data
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+1385.96%)
Mutual labels: spark, hadoop
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+1403.51%)
Mutual labels: spark, hadoop
Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Stars: ✭ 5,379 (+9336.84%)
Mutual labels: spark, hadoop
Kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Stars: ✭ 916 (+1507.02%)
Mutual labels: spark, hadoop
docker-spark-cluster
Build your own Spark cluster setup in Docker.
A multinode Spark installation where each node of the network runs in its own separated Docker container.
The installation takes care of the Hadoop & Spark configuration, providing:
- a debian image with scala and java (scalabase image)
- four fully configured Spark nodes running on Hadoop (sparkbase image):
- nodemaster (master node)
- node2 (slave)
- node3 (slave)
- node4 (slave)
Motivation
You can run Spark in a (boring) standalone setup or create your own network to hold a full cluster setup inside Docker instead.
I find the latter much more fun:
- you can experiment with a more realistic network setup
- tweak nodes configuration
- simulate scalability, downtimes and rebalance by adding/removing nodes to the network automagically
There is a Medium article related to this: https://medium.com/@rubenafo/running-a-spark-cluster-setup-in-docker-containers-573c45cceabf
Installation
- Clone this repository
- cd scalabase
- ./build.sh # This builds the base java+scala debian container from openjdk9
- cd ../spark
- ./build.sh # This builds sparkbase image
- run ./cluster.sh deploy
- The script will finish displaying the Hadoop and Spark admin URLs:
- Hadoop info @ nodemaster: http://172.18.1.1:8088/cluster
- Spark info @ nodemaster : http://172.18.1.1:8080/
- DFS Health @ nodemaster : http://172.18.1.1:9870/dfshealth.html
Options
cluster.sh stop # Stop the cluster
cluster.sh start # Start the cluster
cluster.sh info # Shows handy URLs of running cluster
# Warning! This will remove everything from HDFS
cluster.sh deploy # Format the cluster and deploy images again
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].