All Projects → lynnlangit → Spark-Scala-EKS

lynnlangit / Spark-Scala-EKS

Licence: Apache-2.0 license
Spark Scala docker container sample for AWS testing - EKS & S3

Programming Languages

HCL
1544 projects
shell
77523 projects
scala
5932 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Spark-Scala-EKS

pre-lt-raster-frames
Spark DataFrames for earth observation data
Stars: ✭ 19 (-17.39%)
Mutual labels:  spark-ml
pmml4s-spark
PMML scoring library for Spark as SparkML Transformer
Stars: ✭ 16 (-30.43%)
Mutual labels:  spark-ml
customer churn prediction
零售电商客户流失模型,基于tensorflow,xgboost4j-spark,spark-ml实现LR,FM,GBDT,RF,进行模型效果对比,离线/在线部署方式总结
Stars: ✭ 58 (+152.17%)
Mutual labels:  spark-ml
Machine-Learning
Examples of all Machine Learning Algorithm in Apache Spark
Stars: ✭ 15 (-34.78%)
Mutual labels:  spark-ml
fdp-modelserver
An umbrella project for multiple implementations of model serving
Stars: ✭ 47 (+104.35%)
Mutual labels:  spark-ml
isarn-sketches-spark
Routines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (+21.74%)
Mutual labels:  spark-ml
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+10847.83%)
Mutual labels:  spark-ml
data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Stars: ✭ 34 (+47.83%)
Mutual labels:  spark-ml
ai-deployment
关注AI模型上线、模型部署
Stars: ✭ 149 (+547.83%)
Mutual labels:  spark-ml
awesome-AI-kubernetes
❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
Stars: ✭ 95 (+313.04%)
Mutual labels:  spark-ml
dlsa
Distributed least squares approximation (dlsa) implemented with Apache Spark
Stars: ✭ 25 (+8.7%)
Mutual labels:  spark-ml
machine-learning-course
Machine Learning Course @ Santa Clara University
Stars: ✭ 17 (-26.09%)
Mutual labels:  spark-ml
nsmc-zeppelin-notebook
Movie review dataset Word2Vec & sentiment classification Zeppelin notebook
Stars: ✭ 26 (+13.04%)
Mutual labels:  spark-ml

Scala Spark Docker Example

This example includes:

  • Two Spark/Scala ML example docker containers and sample data
    • Can be run locally (instructions below)
    • Can be run on AWS EKS (Kubernetes) & S3 (no EMR/HDFS needed)
      • AWS-Setup-Guide-Spark-EKS.md- lists setup steps for Spark on AWS EKS
      • NOTE: we used the 'kops' service for this example, as it was required by EKS at the time we wrote this example.

Updated Example Code

Derived from these sources:

  1. SGD Linear Regression Example with Apache Spark by Walker Rowe published May 23, 2017.
  2. The example shows a linear regression example and has been modified to run as an app rather than in the interactive shell. Update, new example: Linear Regression from Spark Documentation. The new example has been updated to add serialization/deserialization and a split between training and test data.
  3. Further reference Predicting Breast Cancer Using Apache Spark Machine Learning Logistic Regression by Carol McDonald published October 17, 2016.

How to Run Locally

This is an sbt project. Assuming we have a working scala and sbt, then execute sbt run from the project root. In addition to the log4j output from spark, you should also see a few lines of output from our example:

18/04/17 15:15:56 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 17) in 21 ms on localhost (executor driver) (1/1)
18/04/17 15:15:56 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
18/04/17 15:15:56 INFO DAGScheduler: ResultStage 14 (show at Main.scala:55) finished in 0.022 s
18/04/17 15:15:56 INFO DAGScheduler: Job 12 finished: show at Main.scala:55, took 0.026455 s
18/04/17 15:15:56 INFO CodeGenerator: Code generated in 8.628902 ms
+-------------------+--------------------+-------------------+
|              label|            features|         prediction|
+-------------------+--------------------+-------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|-1.5332357772511678|
|-26.736207182601724|(10,[0,1,2,3,4,5,...|-3.1990639907463776|
|-22.949825936196074|(10,[0,1,2,3,4,5,...|  2.068559275392233|
|-20.212077258958672|(10,[0,1,2,3,4,5,...| 0.5963989456221626|
|-17.026492264209548|(10,[0,1,2,3,4,5,...|-0.7387387189956682|
|-15.348871155379253|(10,[0,1,2,3,4,5,...|  -1.98575929759793|
|-13.039928064104615|(10,[0,1,2,3,4,5,...| 0.5942050121612523|
| -12.92222310337042|(10,[0,1,2,3,4,5,...|  2.203905559769596|
|-12.773226999251197|(10,[0,1,2,3,4,5,...| -2.736222698097398|
|-12.558575788856189|(10,[0,1,2,3,4,5,...|0.10007973294293643|
|-12.479280211451497|(10,[0,1,2,3,4,5,...|-0.9022515201372355|
| -12.46765638103286|(10,[0,1,2,3,4,5,...|-1.4621820914334354|
|-11.904986902675114|(10,[0,1,2,3,4,5,...|-0.3122307364002444|
| -11.87816749996684|(10,[0,1,2,3,4,5,...| 0.1338819458914437|
| -11.43180236554046|(10,[0,1,2,3,4,5,...| 0.5248457739492374|
|-11.328415936777782|(10,[0,1,2,3,4,5,...| 0.1542916456260936|
|-11.039347808253828|(10,[0,1,2,3,4,5,...|-1.3518353509995789|
|-10.600130341909033|(10,[0,1,2,3,4,5,...| 0.4030016168294734|
|-10.293714040655924|(10,[0,1,2,3,4,5,...| -1.364529194363915|
| -9.892155927826222|(10,[0,1,2,3,4,5,...| -1.068980736429676|
+-------------------+--------------------+-------------------+
only showing top 20 rows

18/04/17 15:15:56 INFO SparkUI: Stopped Spark web UI at http://192.168.86.21:4040
18/04/17 15:15:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/04/17 15:15:56 INFO MemoryStore: MemoryStore cleared
18/04/17 15:15:56 INFO BlockManager: BlockManager stopped
18/04/17 15:15:56 INFO BlockManagerMaster: BlockManagerMaster stopped
18/04/17 15:15:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/04/17 15:15:56 INFO SparkContext: Successfully stopped SparkContext

Run in Docker container

Assuming we have a working local Docker installation execute sbt docker:publishLocal to create the Docker image.

Once the command completes, execute docker images to view the docker image. See output similar to the following:

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
sagemaker-spark     0.1                 29dc8b3b2dc8        20 seconds ago      379MB
openjdk             jre-alpine          b1bd879ca9b3        2 months ago        82MB

Now start a container to run the image by executing docker run --rm sagemaker-spark:0.1. You should see output very similar to the output from the local run.

Note: You may see an error related to insuficient memory, like the one shown below. In which case try increasing your docker engine's memory allocation to 4GB.

18/04/05 18:40:29 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 466092032 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
      at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:216)
      at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:198)
      at org.apache.spark.SparkEnv$.create(SparkEnv.scala:330)
      at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:174)
      at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
      at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
      at FullModel.Main$.main(Main.scala:17)
      at FullModel.Main.main(Main.scala)

Setup Scala Project including Docker

Reference: Lightweight docker containers for Scala apps by Jeroen Rosenberg published August 14, 2017

  1. Create an object Main with a main function. Applications that extend scala.App will not work correctly1.
  2. Add the example code.
  3. Since the example code does not demonstrate how to establish a SparkContext add the following:
    // Startup
    val conf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("SGD")
      .set("spark.executor.memory", "1g")
    val sc = new SparkContext(conf)
  4. Make sure to cleanup the context by adding the following to the example at the end:
    // Shut down
    sc.stop()
  5. Update build.sbt to run locally
    1. Scala version must be no greather than 2.11.x2
    2. Spark version should be 2.1.0 and should look like the following:
      libraryDependencies ++= {
        val sparkVer = "2.1.0"
        Seq(
          "org.apache.spark" %% "spark-core" % sparkVer,
          "org.apache.spark" %% "spark-mllib" % sparkVer
        )
      }
    3. Update project/build.properties, set the sbt.version to 0.13.173
  6. Add docker support
    1. Add project/plugins.sbt file with the following contents
      addSbtPlugin("com.typesafe.sbt" % "sbt-native-packager" % "1.2.1")
    2. Update build.sbt
      1. Add the following to the bottom of the file:
        enablePlugins(JavaAppPackaging)
        enablePlugins(DockerPlugin)
        enablePlugins(AshScriptPlugin)
        
        mainClass in Compile := Some("FullModel.Main")
        
        dockerBaseImage := "openjdk:jre-alpine"
        
        mappings in Universal += file("lpsa.data") -> "lpsa.data"
      2. The SBT commands enable plugins that
        1. Allow our app to be packaged as a jar(s) with an executable shell script to run it.
        2. Publish a docker image with the packaged app.
        3. Use ash as the shell instead of bash (needed for alpine based images)
      3. Next we declare the mainClass so that the generated app script knows which class to execute.
      4. Then we instruct the Docker plugin to use a smaller alpine based image rather than the default Debian based image.
      5. Finally we provide a file mapping which instructs the packaging system to include our data file in the staging directory that is later used to construct the image.4

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].