All Projects → kakao → Cuesheet

kakao / Cuesheet

Licence: apache-2.0
A framework for writing Spark 2.x applications in a pretty way

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Cuesheet

Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-83.72%)
Mutual labels:  spark, apache-spark
Spark States
Custom state store providers for Apache Spark
Stars: ✭ 83 (-3.49%)
Mutual labels:  spark, apache-spark
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+1009.3%)
Mutual labels:  spark, magic
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+980.23%)
Mutual labels:  spark, apache-spark
Apache Spark Internals
The Internals of Apache Spark
Stars: ✭ 1,045 (+1115.12%)
Mutual labels:  spark, apache-spark
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+896.51%)
Mutual labels:  spark, yarn
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-56.98%)
Mutual labels:  spark, apache-spark
Sparkle
Haskell on Apache Spark.
Stars: ✭ 419 (+387.21%)
Mutual labels:  spark, apache-spark
Spark As Service Using Embedded Server
This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server
Stars: ✭ 46 (-46.51%)
Mutual labels:  spark, apache-spark
Spark Tda
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
Stars: ✭ 45 (-47.67%)
Mutual labels:  spark, apache-spark
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+822.09%)
Mutual labels:  spark, apache-spark
Pulsar Spark
When Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-36.05%)
Mutual labels:  spark, apache-spark
Sparklyr
R interface for Apache Spark
Stars: ✭ 775 (+801.16%)
Mutual labels:  spark, apache-spark
Magic
CSS3 Animations with special effects
Stars: ✭ 7,253 (+8333.72%)
Mutual labels:  magic, yarn
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+746.51%)
Mutual labels:  spark, apache-spark
Spark Flamegraph
Easy CPU Profiling for Apache Spark applications
Stars: ✭ 30 (-65.12%)
Mutual labels:  spark, apache-spark
Enterprise gateway
A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
Stars: ✭ 412 (+379.07%)
Mutual labels:  spark, yarn
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+380.23%)
Mutual labels:  spark, apache-spark
Spark Examples
Spark examples
Stars: ✭ 41 (-52.33%)
Mutual labels:  spark, apache-spark
Spark Nkp
Natural Korean Processor for Apache Spark
Stars: ✭ 50 (-41.86%)
Mutual labels:  spark, apache-spark

CueSheet

GitHub version Join the chat at https://gitter.im/kakao/cuesheet

CueSheet is a framework for writing Apache Spark 2.x applications more conveniently, designed to neatly separate the concerns of the business logic and the deployment environment, as well as to minimize the usage of shell scripts which are inconvenient to write and do not support validation. To jump-start, check out cuesheet-starter-kit which provides the skeleton for building CueSheet applications. CueSheet is featured in Spark Summit East 2017.

An example of a CueSheet application is shown below. Any Scala object extending CueSheet becomes a CueSheet application; the object body can then use the variables like sc, sqlContext, and spark to write the business logic, as if it is inside spark-shell:

import com.kakao.cuesheet.CueSheet

object Example extends CueSheet {{
  val rdd = sc.parallelize(1 to 100)
  println(s"sum = ${rdd.sum()}")
  println(s"sum2 = ${rdd.map(_ + 1).sum()}")
}}

CueSheet will take care of creating SparkContext or SparkSession according to the configuration given in a separate file, so that your application code can contain just the business logic. Furthermore, CueSheet will launch the application locally or to a YARN cluster by simply running your object as a Java application, eliminating the need to use spark-submit and accompanying shell scripts.

CueSheet also supports Spark Streaming applications, via ssc. When it is used in the object body, it automatically becomes a Spark Streaming application, and ssc provides access to the StreamingContext.

Importing CueSheet

libraryDependencies += "com.kakao.cuesheet" %% "cuesheet" % "0.10.0"

CueSheet can be used in Scala projects by configuring SBT as above. Note that this dependency is not specified as "provided", which makes it possible to launch the application right in the IDE, and even debug using breakpoints in driver code when launched in client mode.

Configuration

Configurations for your CueSheet application, including Spark configurations and the arguments in spark-submit, are specified using the HOCON format. It is by default application.conf in your classpath root, but an alternate configuration file can be specified using -Dconfig.resource or -Dconfig.file. Below is an example configuration file.

spark {
  master = "yarn:classpath:com.kakao.cuesheet.launcher.test"
  deploy.mode = cluster

  hadoop.user.name = "cloudera"

  executor.instances = 2
  executor.memory = 1g
  driver.memory = 1g

  streaming.blockInterval = 10000
  eventLog.enabled = false
  eventLog.dir = "hdfs:///user/spark/applicationHistory"
  yarn.historyServer.address = "http://history.server:18088"

  driver.extraJavaOptions = "-XX:MaxPermSize=512m"
}

Unlike the standard spark configuration, spark.master for YARN should include an indicator for finding YARN/Hive/Hadoop configurations. It is the easiest to put the XML files inside your classpath, usually by putting them under src/main/resources, and specify the package classpath as above. Alternatively, spark.master can contain a URL to download the configuration in a ZIP file, e.g. yarn:http://cloudera.manager/hive/configuration.zip, copied from Cloudera Manager's 'Download Client Configuration' link. The usual local or local[8] can also be used as spark.master.

deploy.mode can be either client or cluster, and spark.hadoop.user.name should be the username to be used as the Hadoop user. CueSheet assumes that this user has the write permission to the home directory.

Using HDFS

While submitting an application to YARN, CueSheet will copy Spark and CueSheet's dependency jars to HDFS. This way, in the next time you submit your application, CueSheet will analyze your classpath to find and assemble only the classes that are not part of the already installed jars.

One-Liner for Easy Deployment

When given a tag name as system property cuesheet.install, CueSheet will print a rather long shell command which can launch your application from anywhere hdfs command is available. Below is an example of the one-liner shell command that CueSheet produces when given -Dcuesheet.install=v0.0.1 as a JVM argument.

rm -rf SimpleExample_2.10-v0.0.1 && mkdir SimpleExample_2.10-v0.0.1 && cd SimpleExample_2.10-v0.0.1 &&
echo '<configuration><property><name>dfs.ha.automatic-failover.enabled</name><value>false</value></property><property><name>fs.defaultFS</name><value>hdfs://quickstart.cloudera:8020</value></property></configuration>' > core-site.xml &&
hdfs --config . dfs -get hdfs:///user/cloudera/.cuesheet/applications/com.kakao.cuesheet.SimpleExample/v0.0.1/SimpleExample_2.10.jar \!SimpleExample_2.10.jar &&
hdfs --config . dfs -get hdfs:///user/cloudera/.cuesheet/lib/0.10.0-SNAPSHOT-scala-2.10-spark-2.1.0/*.jar &&
java -classpath "*" com.kakao.cuesheet.SimpleExample "hello" "world" && cd .. && rm -rf SimpleExample_2.10-v0.0.1

What this command does is to download the CueSheet and Spark jars as well as your application assembly from HDFS, and launch the application in the same environment that was launched in the IDE. This way, it is not required to have HADOOP_CONF_DIR or SPARK_HOME properly installed and set on every node, making it much easier to use it in distributed schedulers like Marathon, Chronos, or Aurora. These schedulers typically allow a single-line shell command as their job specification, so you can simply paste what CueSheet gives you in the scheduler's Web UI.

Additional Features

Being started as a library of reusable Spark functions, CueSheet contains a number of additional features, not in an extremely coherent manner. Many parts of CueSheet including these features are powered by Mango library, another open-source project by Kakao.

One additional quirk is the "stop" tab CueSheet adds to the Spark UI. As shown below, it features three buttons with an increasing degree of seriousness. To stop a Spark Streaming application, to possibly trigger a restart by a scheduler like Marathon, one of the left two buttons will do the job. If you need to halt a Spark application ASAP, the red button will immediately kill the Spark driver.

Stop Tab

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2017 Kakao Corp. http://www.kakaocorp.com

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].