All Projects → tweag → Sparkle

tweag / Sparkle

Licence: other
Haskell on Apache Spark.

Programming Languages

haskell
3896 projects

Projects that are alternatives of or similar to Sparkle

Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-64.2%)
Mutual labels:  spark, analytics, apache-spark
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+310.74%)
Mutual labels:  spark, analytics, apache-spark
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-96.66%)
Mutual labels:  spark, analytics, apache-spark
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (-1.43%)
Mutual labels:  spark, analytics, apache-spark
spark-structured-streaming-examples
Spark structured streaming examples with using of version 3.0.0
Stars: ✭ 23 (-94.51%)
Mutual labels:  spark, apache-spark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-73.51%)
Mutual labels:  spark, apache-spark
Spark Notebook
Interactive and Reactive Data Science using Scala and Spark.
Stars: ✭ 3,081 (+635.32%)
Mutual labels:  spark, apache-spark
Coolplayspark
酷玩 Spark: Spark 源代码解析、Spark 类库等
Stars: ✭ 3,318 (+691.89%)
Mutual labels:  spark, apache-spark
Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (-41.05%)
Mutual labels:  spark, apache-spark
Delta
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Stars: ✭ 3,903 (+831.5%)
Mutual labels:  spark, analytics
Clickhouse Native Jdbc
ClickHouse Native Protocol JDBC implementation
Stars: ✭ 310 (-26.01%)
Mutual labels:  spark, analytics
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-96.9%)
Mutual labels:  spark, apache-spark
spark-gradle-template
Apache Spark in your IDE with gradle
Stars: ✭ 39 (-90.69%)
Mutual labels:  spark, apache-spark
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Stars: ✭ 259 (-38.19%)
Mutual labels:  spark, apache-spark
awesome-AI-kubernetes
❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
Stars: ✭ 95 (-77.33%)
Mutual labels:  spark, analytics
Learningsparkv2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Stars: ✭ 307 (-26.73%)
Mutual labels:  spark, apache-spark
Sparkmeasure
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
Stars: ✭ 368 (-12.17%)
Mutual labels:  spark, apache-spark
Kyuubi
Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark
Stars: ✭ 363 (-13.37%)
Mutual labels:  spark, analytics
Spark Structured Streaming Book
The Internals of Spark Structured Streaming
Stars: ✭ 371 (-11.46%)
Mutual labels:  spark, apache-spark
Mastering Spark Sql Book
The Internals of Spark SQL
Stars: ✭ 234 (-44.15%)
Mutual labels:  spark, apache-spark

sparkle: Apache Spark applications in Haskell

Build status CircleCI

sparkle [spär′kəl]: a library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See this blog post for the details.

Getting started

The tl;dr using the hello app as an example on your local machine:

$ nix-shell --pure --run "bazel build //apps/hello:sparkle-example-hello_deploy.jar"
$ nix-shell --pure --run "bazel run spark-submit -- --packages com.amazonaws:aws-java-sdk:1.11.920,org.apache.hadoop:hadoop-aws:2.8.4 $PWD/bazel-bin/apps/hello/sparkle-example-hello_deploy.jar"

You'll need nix for the above to work.

How it works

sparkle is a tool for creating self-contained Spark applications in Haskell. Spark applications are typically distributed as JAR files, so that's what sparkle creates. We embed Haskell native object code as compiled by GHC in these JAR files, along with any shared library required by this object code to run. Spark dynamically loads this object code into its address space at runtime and interacts with it via the Java Native Interface (JNI).

How to use

To run a Spark application the process is as follows:

  1. create an application in the apps/ folder, in-repo or as a submodule;
  2. build the app;
  3. submit it to a local or cluster deployment of Spark.

If you run into issues, read the Troubleshooting section below first.

Build

Linux

Include the following in a BUILD.bazel file next to your source code.

package(default_visibility = ["//visibility:public"])

load(
  "@rules_haskell//haskell:defs.bzl",
  "haskell_binary",
)

load("@//:sparkle.bzl", "sparkle_package")

haskell_binary(
  name = "hello-hs",
  linkstatic = False,
  compiler_flags = ["-threaded", "-pie"],
  srcs = ...,
  deps = ...,
  ...
)

sparkle_package(
  name = "sparkle-example-hello",
  src = ":hello-hs",
)

And then ask Bazel to build a deploy jar file.

$ nix-shell --pure --run "bazel build //apps/hello:sparkle-example-hello_deploy.jar"

Other platforms

sparkle builds in Mac OS X, but running it requires installing binaries for Spark and maybe Hadoop (See .circleci/config.yml.

Another alternative is to build and run sparkle via Docker in non-Linux platforms, using a docker image provisioned with Nix.

Integrating sparkle in another project

As sparkle interacts with the JVM, you need to tell ghc where JVM-specific headers and libraries are. It needs to be able to locate jni.h, jni_md.h and libjvm.so.

sparkle uses inline-java to embed fragments of Java code in Haskell modules, which requires running the javac compiler, which must be available in the PATH of the shell. Moreover, javac needs to find the Spark classes that inline-java quotations refer to. Therefore, these classes need to be added to the CLASSPATH when building sparkle. Dependending on your build system, how to do this might vary. In this repo, we use gradle to install Spark, and we query gradle to get the paths we need to add to the CLASSPATH.

Additionally, the classes need to be found at runtime to load them. The main thread can find them, but other threads need to invoke initializeSparkThread or runInSparkThread from Control.Distributed.Spark.

If the main function terminates with unhandled exceptions, they can be propagated to Spark with Control.Distributed.Spark.forwardUnhandledExceptionsToSpark. This allows spark both to report the exception and to cleanup before termination.

Submit

Finally, to run your application, for example locally:

$ nix-shell --pure --run "bazel run spark-submit -- /path/to/$PWD/<app-target-name>_deploy.jar"

The <app-target-name> is the name of the Bazel target producing the jar file. See apps in the apps/ folder for examples.

RTS options can be passed as a java property

$ nix-shell --pure --run "bazel run spark-submit -- --driver-java-options=-Dghc_rts_opts='+RTS\ -s\ -RTS' <app-target-name>_deploy.jar

or as command line arguments

$ nix-shell --pure --run "bazel run spark-submit -- <app-target-name>_deploy.jar +RTS -s -RTS

See here for other options, including launching a whole cluster from scratch on EC2. This blog post shows you how to get started on the Databricks hosted platform and on Amazon's Elastic MapReduce.

Troubleshooting

JNI calls in auxiliary threads fail with ClassNotFoundException

The context class loader of threads needs to be set appropriately before JNI calls can find classes in Spark. Calling initializeSparkThread or runInSparkThread from Control.Distributed.Spark should set it.

Anonymous classes in inline-java quasiquotes fail to deserialize

When using inline-java, it is recommended to use the Kryo serializer, which is currently not the default in Spark but is faster anyways. If you don't use the Kryo serializer, objects of anonymous class, which arise e.g. when using Java 8 function literals,

foo :: RDD Int -> IO (RDD Bool)
foo rdd = [java| $rdd.map((Integer x) -> x.equals(0)) |]

won't be deserialized properly in multi-node setups. To avoid this problem, switch to the Kryo serializer by setting the following configuration properties in your SparkConf:

do conf <- newSparkConf "some spark app"
   confSet conf "spark.serializer" "org.apache.spark.serializer.KryoSerializer"
   confSet conf "spark.kryo.registrator" "io.tweag.sparkle.kryo.InlineJavaRegistrator"

See #104 for more details.

java.lang.UnsatisfiedLinkError: /tmp/sparkle-app...: failed to map segment from shared object

Sparkle unzips the Haskell binary program in a temporary location on the filesystem and then loads it from there. For loading to succeed, the temporary location must not be mounted with the noexec option. Alternatively, the temporary location can be changed with

spark-submit --driver-java-options="-Djava.io.tmpdir=..." \
             --conf "spark.executor.extraJavaOptions=-Djava.io.tmpdir=..."

java.io.IOException: No FileSystem for scheme: s3n

Spark 2.4 requires explicitly specifying extra JAR files to spark-submit in order to work with AWS. To work around this, add an additional 'packages' argument when submitting the job:

spark-submit --packages com.amazonaws:aws-java-sdk:1.11.920,org.apache.hadoop:hadoop-aws:2.8.4

License

Copyright (c) 2015-2016 EURL Tweag.

All rights reserved.

sparkle is free software, and may be redistributed under the terms specified in the LICENSE file.

Sponsors

         Tweag I/O              LeapYear

sparkle is maintained by Tweag I/O.

Have questions? Need help? Tweet at @tweagio.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].