All Projects → mvillarrealb → Docker Spark Cluster

mvillarrealb / Docker Spark Cluster

A simple spark standalone cluster for your testing environment purposses

Projects that are alternatives of or similar to Docker Spark Cluster

Ecommercerecommendsystem
商品大数据实时推荐系统。前端:Vue + TypeScript + ElementUI,后端 Spring + Spark
Stars: ✭ 139 (-46.74%)
Mutual labels:  spark, bigdata
Big Data Rosetta Code
Code snippets for solving common big data problems in various platforms. Inspired by Rosetta Code
Stars: ✭ 254 (-2.68%)
Mutual labels:  spark, bigdata
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-46.36%)
Mutual labels:  spark, bigdata
Lambda Arch
Applying Lambda Architecture with Spark, Kafka, and Cassandra.
Stars: ✭ 111 (-57.47%)
Mutual labels:  spark, bigdata
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (-80.84%)
Mutual labels:  spark, bigdata
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-51.72%)
Mutual labels:  spark, bigdata
Javaorbigdata Interview
Java开发者或者大数据开发者面试知识点整理
Stars: ✭ 203 (-22.22%)
Mutual labels:  spark, bigdata
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+4111.11%)
Mutual labels:  spark, bigdata
Every Single Day I Tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (-4.6%)
Mutual labels:  spark, bigdata
Dpark
Python clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+922.22%)
Mutual labels:  spark, bigdata
Sparktutorial
Source code for James Lee's Aparch Spark with Java course
Stars: ✭ 105 (-59.77%)
Mutual labels:  spark, bigdata
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-95.02%)
Mutual labels:  spark, bigdata
Splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Stars: ✭ 105 (-59.77%)
Mutual labels:  spark, bigdata
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+559.39%)
Mutual labels:  spark, bigdata
Bigdata Notebook
Stars: ✭ 100 (-61.69%)
Mutual labels:  spark, bigdata
Kotlin Spark Api
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Stars: ✭ 183 (-29.89%)
Mutual labels:  spark, bigdata
Cleanframes
type-class based data cleansing library for Apache Spark SQL
Stars: ✭ 75 (-71.26%)
Mutual labels:  spark, bigdata
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+412.64%)
Mutual labels:  spark, bigdata
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (-17.62%)
Mutual labels:  spark, bigdata
yuzhouwan
Code Library for My Blog
Stars: ✭ 39 (-85.06%)
Mutual labels:  spark, bigdata

Spark Cluster with Docker & docker-compose

General

A simple spark standalone cluster for your testing environment purposses. A docker-compose up away from you solution for your spark development environment.

The Docker compose will create the following containers:

container Ip address
spark-master 10.5.0.2
spark-worker-1 10.5.0.3
spark-worker-2 10.5.0.4
spark-worker-3 10.5.0.5

Installation

The following steps will make you run your spark cluster's containers.

Pre requisites

  • Docker installed

  • Docker compose installed

  • A spark Application Jar to play with(Optional)

Build the images

The first step to deploy the cluster will be the build of the custom images, these builds can be performed with the build-images.sh script.

The executions is as simple as the following steps:

chmod +x build-images.sh
./build-images.sh

This will create the following docker images:

  • spark-base:2.3.1: A base image based on java:alpine-jdk-8 wich ships scala, python3 and spark 2.3.1

  • spark-master:2.3.1: A image based on the previously created spark image, used to create a spark master containers.

  • spark-worker:2.3.1: A image based on the previously created spark image, used to create spark worker containers.

  • spark-submit:2.3.1: A image based on the previously created spark image, used to create spark submit containers(run, deliver driver and die gracefully).

Run the docker-compose

The final step to create your test cluster will be to run the compose file:

docker-compose up --scale spark-worker=3

Validate your cluster

Just validate your cluster accesing the spark UI on each worker & master URL.

Spark Master

http://10.5.0.2:8080/

alt text

Spark Worker 1

http://10.5.0.3:8081/

alt text

Spark Worker 2

http://10.5.0.4:8081/

alt text

Spark Worker 3

http://10.5.0.5:8081/

alt text

Resource Allocation

This cluster is shipped with three workers and one spark master, each of these has a particular set of resource allocation(basically RAM & cpu cores allocation).

  • The default CPU cores allocation for each spark worker is 1 core.

  • The default RAM for each spark-worker is 1024 MB.

  • The default RAM allocation for spark executors is 256mb.

  • The default RAM allocation for spark driver is 128mb

  • If you wish to modify this allocations just edit the env/spark-worker.sh file.

Binded Volumes

To make app running easier I've shipped two volume mounts described in the following chart:

Host Mount Container Mount Purposse
/mnt/spark-apps /opt/spark-apps Used to make available your app's jars on all workers & master
/mnt/spark-data /opt/spark-data Used to make available your app's data on all workers & master

This is basically a dummy DFS created from docker Volumes...(maybe not...)

Run a sample application

Now let`s make a wild spark submit to validate the distributed nature of our new toy following these steps:

Create a Scala spark app

The first thing you need to do is to make a spark application. Our spark-submit image is designed to run scala code (soon will ship pyspark support guess I was just lazy to do so..).

In my case I am using an app called crimes-app. You can make or use your own scala app, I 've just used this one because I had it at hand.

Ship your jar & dependencies on the Workers and Master

A necesary step to make a spark-submit is to copy your application bundle into all workers, also any configuration file or input file you need.

Luckily for us we are using docker volumes so, you just have to copy your app and configs into /mnt/spark-apps, and your input files into /mnt/spark-files.

#Copy spark application into all workers's app folder
cp /home/workspace/crimes-app/build/libs/crimes-app.jar /mnt/spark-apps

#Copy spark application configs into all workers's app folder
cp -r /home/workspace/crimes-app/config /mnt/spark-apps

# Copy the file to be processed to all workers's data folder
cp /home/Crimes_-_2001_to_present.csv /mnt/spark-files

Check the successful copy of the data and app jar (Optional)

This is not a necessary step, just if you are curious you can check if your app code and files are in place before running the spark-submit.

# Worker 1 Validations
docker exec -ti spark-worker-1 ls -l /opt/spark-apps

docker exec -ti spark-worker-1 ls -l /opt/spark-data

# Worker 2 Validations
docker exec -ti spark-worker-2 ls -l /opt/spark-apps

docker exec -ti spark-worker-2 ls -l /opt/spark-data

# Worker 3 Validations
docker exec -ti spark-worker-3 ls -l /opt/spark-apps

docker exec -ti spark-worker-3 ls -l /opt/spark-data

After running one of this commands you have to see your app's jar and files.

Use docker spark-submit

#Creating some variables to make the docker run command more readable
#App jar environment used by the spark-submit image
SPARK_APPLICATION_JAR_LOCATION="/opt/spark-apps/crimes-app.jar"
#App main class environment used by the spark-submit image
SPARK_APPLICATION_MAIN_CLASS="org.mvb.applications.CrimesApp"
#Extra submit args used by the spark-submit image
SPARK_SUBMIT_ARGS="--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/dev/config.conf'"

#We have to use the same network as the spark cluster(internally the image resolves spark master as spark://spark-master:7077)
docker run --network docker-spark-cluster_spark-network \
-v /mnt/spark-apps:/opt/spark-apps \
--env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION \
--env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS \
spark-submit:2.3.1

After running this you will see an output pretty much like this:

Running Spark using the REST application submission protocol.
2018-09-23 15:17:52 INFO  RestSubmissionClient:54 - Submitting a request to launch an application in spark://spark-master:6066.
2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Submission successfully created as driver-20180923151753-0000. Polling submission state...
2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180923151753-0000 in spark://spark-master:6066.
2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - State of driver driver-20180923151753-0000 is now RUNNING.
2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Driver is running on worker worker-20180923151711-10.5.0.4-45381 at 10.5.0.4:45381.
2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20180923151753-0000",
  "serverSparkVersion" : "2.3.1",
  "submissionId" : "driver-20180923151753-0000",
  "success" : true
}

Summary (What have I done :O?)

  • We compiled the necessary docker images to run spark master and worker containers.

  • We created a spark standalone cluster using 3 worker nodes and 1 master node using docker && docker-compose.

  • Copied the resources necessary to run a sample application.

  • Submitted an application to the cluster using a spark-submit docker image.

  • We ran a distributed application at home(just need enough cpu cores and RAM to do so).

Why a standalone cluster?

  • This is intended to be used for test purposses, basically a way of running distributed spark apps on your laptop or desktop.

  • Right now I don't have enough resources to make a Yarn, Mesos or Kubernetes based cluster :(.

  • This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].