All Projects → big-data-europe → Docker Spark

big-data-europe / Docker Spark

Apache Spark docker image

Projects that are alternatives of or similar to Docker Spark

Spark Flamegraph
Easy CPU Profiling for Apache Spark applications
Stars: ✭ 30 (-97.85%)
Mutual labels:  apache-spark
Apache Spark Internals
The Internals of Apache Spark
Stars: ✭ 1,045 (-25.14%)
Mutual labels:  apache-spark
Awesome Pulsar
A curated list of Pulsar tools, integrations and resources.
Stars: ✭ 57 (-95.92%)
Mutual labels:  apache-spark
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-97.35%)
Mutual labels:  apache-spark
Spark Scala Maven Example
Example Maven configuration for a Spark, Scala project
Stars: ✭ 45 (-96.78%)
Mutual labels:  apache-spark
Spark Nkp
Natural Korean Processor for Apache Spark
Stars: ✭ 50 (-96.42%)
Mutual labels:  apache-spark
Spark Streaming Monitoring With Lightning
Plot live-stats as graph from ApacheSpark application using Lightning-viz
Stars: ✭ 15 (-98.93%)
Mutual labels:  apache-spark
Cuesheet
A framework for writing Spark 2.x applications in a pretty way
Stars: ✭ 86 (-93.84%)
Mutual labels:  apache-spark
Spark As Service Using Embedded Server
This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server
Stars: ✭ 46 (-96.7%)
Mutual labels:  apache-spark
Pulsar Spark
When Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-96.06%)
Mutual labels:  apache-spark
Dblink
Distributed Bayesian Entity Resolution in Apache Spark
Stars: ✭ 38 (-97.28%)
Mutual labels:  apache-spark
Spark Tda
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
Stars: ✭ 45 (-96.78%)
Mutual labels:  apache-spark
Awesome Spark
A curated list of awesome Apache Spark packages and resources.
Stars: ✭ 1,061 (-24%)
Mutual labels:  apache-spark
Cloud Based Sql Engine Using Spark
Cloud-based SQL engine using SPARK where data is accessible as JDBC/ODBC data source via Spark ThriftServer.
Stars: ✭ 30 (-97.85%)
Mutual labels:  apache-spark
Mlflow
Open source platform for the machine learning lifecycle
Stars: ✭ 10,898 (+680.66%)
Mutual labels:  apache-spark
Datahacksummit 2017
Apache Zeppelin notebooks for Recommendation Engines using Keras and Machine Learning on Apache Spark
Stars: ✭ 30 (-97.85%)
Mutual labels:  apache-spark
Spark Sklearn
(Deprecated) Scikit-learn integration package for Apache Spark
Stars: ✭ 1,055 (-24.43%)
Mutual labels:  apache-spark
Pyspark Stubs
Apache (Py)Spark type annotations (stub files).
Stars: ✭ 98 (-92.98%)
Mutual labels:  apache-spark
Spark States
Custom state store providers for Apache Spark
Stars: ✭ 83 (-94.05%)
Mutual labels:  apache-spark
Sparkit Learn
PySpark + Scikit-learn = Sparkit-learn
Stars: ✭ 1,073 (-23.14%)
Mutual labels:  apache-spark

Gitter chat Build Status Twitter

Spark docker

Docker images to:

  • Setup a standalone Apache Spark cluster running one Spark Master and multiple Spark workers
  • Build Spark applications in Java, Scala or Python to run on a Spark cluster
Currently supported versions:
  • Spark 3.1.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.1.1 for Hadoop 3.2 with OpenJDK 11 and Scala 2.12
  • Spark 3.0.2 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.0.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.0.0 for Hadoop 3.2 with OpenJDK 11 and Scala 2.12
  • Spark 3.0.0 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 2.4.5 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.4 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.3 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.0 for Hadoop 2.8 with OpenJDK 8 and Scala 2.12
  • Spark 2.4.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.1 for Hadoop 2.8 with OpenJDK 8
  • Spark 2.3.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.3 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8
  • Spark 2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7
  • Spark 1.6.2 for Hadoop 2.6 and later
  • Spark 1.5.1 for Hadoop 2.6 and later

Using Docker Compose

Add the following services to your docker-compose.yml to integrate a Spark master and Spark worker in your BDE pipeline:

spark-master:
  image: bde2020/spark-master:3.1.1-hadoop3.2
  container_name: spark-master
  ports:
    - "8080:8080"
    - "7077:7077"
  environment:
    - INIT_DAEMON_STEP=setup_spark
spark-worker-1:
  image: bde2020/spark-worker:3.1.1-hadoop3.2
  container_name: spark-worker-1
  depends_on:
    - spark-master
  ports:
    - "8081:8081"
  environment:
    - "SPARK_MASTER=spark://spark-master:7077"
spark-worker-2:
  image: bde2020/spark-worker:3.1.1-hadoop3.2
  container_name: spark-worker-2
  depends_on:
    - spark-master
  ports:
    - "8081:8081"
  environment:
    - "SPARK_MASTER=spark://spark-master:7077"

Make sure to fill in the INIT_DAEMON_STEP as configured in your pipeline.

Running Docker containers without the init daemon

Spark Master

To start a Spark master:

docker run --name spark-master -h spark-master -e ENABLE_INIT_DAEMON=false -d bde2020/spark-master:3.1.1-hadoop3.2

Spark Worker

To start a Spark worker:

docker run --name spark-worker-1 --link spark-master:spark-master -e ENABLE_INIT_DAEMON=false -d bde2020/spark-worker:3.1.1-hadoop3.2

Launch a Spark application

Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.

Kubernetes deployment

The BDE Spark images can also be used in a Kubernetes enviroment.

To deploy a simple Spark standalone cluster issue

kubectl apply -f https://raw.githubusercontent.com/big-data-europe/docker-spark/master/k8s-spark-cluster.yaml

This will setup a Spark standalone cluster with one master and a worker on every available node using the default namespace and resources. The master is reachable in the same namespace at spark://spark-master:7077. It will also setup a headless service so spark clients can be reachable from the workers using hostname spark-client.

Then to use spark-shell issue

kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:3.1.1-hadoop3.2 -- bash ./spark/bin/spark-shell --master spark://spark-master:7077 --conf spark.driver.host=spark-client

To use spark-submit issue for example

kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:3.1.1-hadoop3.2 -- bash ./spark/bin/spark-submit --class CLASS_TO_RUN --master spark://spark-master:7077 --deploy-mode client --conf spark.driver.host=spark-client URL_TO_YOUR_APP

You can use your own image packed with Spark and your application but when deployed it must be reachable from the workers. One way to achieve this is by creating a headless service for your pod and then use --conf spark.driver.host=YOUR_HEADLESS_SERVICE whenever you submit your application.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].