All Projects → tupol → spark-utils

tupol / spark-utils

Licence: MIT license
Basic framework utilities to quickly start writing production ready Apache Spark applications

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to spark-utils

Spark Streaming Monitoring With Lightning
Plot live-stats as graph from ApacheSpark application using Lightning-viz
Stars: ✭ 15 (-40%)
Mutual labels:  apache-spark, spark-streaming
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+3616%)
Mutual labels:  apache-spark, spark-streaming
Coolplayspark
酷玩 Spark: Spark 源代码解析、Spark 类库等
Stars: ✭ 3,318 (+13172%)
Mutual labels:  apache-spark, spark-streaming
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+608%)
Mutual labels:  apache-spark, spark-streaming
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+6784%)
Mutual labels:  apache-spark, spark-streaming
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (+48%)
Mutual labels:  apache-spark, spark-streaming
Streaming Readings
Streaming System 相关的论文读物
Stars: ✭ 554 (+2116%)
Mutual labels:  apache-spark, spark-streaming
Spark States
Custom state store providers for Apache Spark
Stars: ✭ 83 (+232%)
Mutual labels:  apache-spark, spark-streaming
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (+460%)
Mutual labels:  apache-spark, spark-streaming
Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+888%)
Mutual labels:  apache-spark, spark-streaming
BigInsights-on-Apache-Hadoop
Example projects for 'BigInsights for Apache Hadoop' on IBM Bluemix
Stars: ✭ 21 (-16%)
Mutual labels:  spark-streaming
BigCLAM-ApacheSpark
Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark
Stars: ✭ 40 (+60%)
Mutual labels:  apache-spark
hyperdrive
Extensible streaming ingestion pipeline on top of Apache Spark
Stars: ✭ 31 (+24%)
Mutual labels:  apache-spark
SparkTwitterAnalysis
An Apache Spark standalone application using the Spark API in Scala. The application uses Simple Build Tool(SBT) for building the project.
Stars: ✭ 29 (+16%)
Mutual labels:  apache-spark
net.jgp.books.spark.ch07
Spark in Action, 2nd edition - chapter 7 - Ingestion from files
Stars: ✭ 13 (-48%)
Mutual labels:  apache-spark
jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Stars: ✭ 78 (+212%)
Mutual labels:  apache-spark
bitnami-docker-spark
Bitnami Docker Image for Apache Spark
Stars: ✭ 239 (+856%)
Mutual labels:  spark-streaming
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (+104%)
Mutual labels:  apache-spark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+56%)
Mutual labels:  apache-spark
parquet-dotnet
🐬 Apache Parquet for modern .Net
Stars: ✭ 199 (+696%)
Mutual labels:  apache-spark

Spark Utils

Maven Central   GitHub   Travis (.org)   Codecov   Javadocs   Gitter   Twitter  

Motivation

One of the biggest challenges after taking the first steps into the world of writing Apache Spark applications in Scala is taking them to production.

An application of any kind needs to be easy to run and easy to configure.

This project is trying to help developers write Spark applications focusing mainly on the application logic rather than the details of configuring the application and setting up the Spark context.

This project is also trying to create and encourage a friendly yet professional environment for developers to help each other, so please do no be shy and join through gitter, twitter, issue reports or pull requests.

Description

This project contains some basic utilities that can help setting up a Spark application project.

The main point is the simplicity of writing Apache Spark applications just focusing on the logic, while providing for easy configuration and arguments passing.

The code sample bellow shows how easy can be to write a file format converter from any acceptable type, with any acceptable parsing configuration options to any acceptable format.

object FormatConverterExample extends SparkApp[FormatConverterContext, DataFrame] {
  override def createContext(config: Config) = FormatConverterContext(config)
  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = {
    val inputData = spark.source(context.input).read
    inputData.sink(context.output).write
  }
}

Creating the configuration can be as simple as defining a case class to hold the configuration and a factory, that helps extract simple and complex data types like input sources and output sinks.

case class FormatConverterContext(input: FormatAwareDataSourceConfiguration,
                                  output: FormatAwareDataSinkConfiguration)

object FormatConverterContext extends Configurator[FormatConverterContext] {
  import com.typesafe.config.Config
  import scalaz.ValidationNel

  def validationNel(config: Config): ValidationNel[Throwable, FormatConverterContext] = {
    import scalaz.syntax.applicative._
    config.extract[FormatAwareDataSourceConfiguration]("input") |@|
      config.extract[FormatAwareDataSinkConfiguration]("output") apply
      FormatConverterContext.apply
  }
}

Optionally, the SparkFun can be used instead of SparkApp to make the code even more concise.

object FormatConverterExample extends 
          SparkFun[FormatConverterContext, DataFrame](FormatConverterContext(_).get) {
  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = 
    spark.source(context.input).read.sink(context.output).write
}

For structured streaming applications the format converter might look like this:

object StreamingFormatConverterExample extends SparkApp[StreamingFormatConverterContext, DataFrame] {
  override def createContext(config: Config) = StreamingFormatConverterContext(config).get
  override def run(implicit spark: SparkSession, context: StreamingFormatConverterContext): Try[DataFrame] = {
    val inputData = spark.source(context.input).read
    inputData.streamingSink(context.output).write.awaitTermination()
  }
}

The streaming configuration the configuration can be as simple as following:

case class StreamingFormatConverterContext(input: FormatAwareStreamingSourceConfiguration, 
                                           output: FormatAwareStreamingSinkConfiguration)

object StreamingFormatConverterContext extends Configurator[StreamingFormatConverterContext] {
  def validationNel(config: Config): ValidationNel[Throwable, StreamingFormatConverterContext] = {
    config.extract[FormatAwareStreamingSourceConfiguration]("input") |@|
      config.extract[FormatAwareStreamingSinkConfiguration]("output") apply
      StreamingFormatConverterContext.apply
  }
}

The SparkRunnable and SparkApp or SparkFun together with the configuration framework provide for easy Spark application creation with configuration that can be managed through configuration files or application parameters.

The IO frameworks for reading and writing data frames add extra convenience for setting up batch and structured streaming jobs that transform various types of files and streams.

Last but not least, there are many utility functions that provide convenience for loading resources, dealing with schemas and so on.

Most of the common features are also implemented as decorators to main Spark classes, like SparkContext, DataFrame and StructType and they are conveniently available by importing the org.tupol.spark.implicits._ package.

Documentation

The documentation for the main utilities and frameworks available:

Latest stable API documentation is available here.

An extensive tutorial and walk-through can be found here. Extensive samples and demos can be found here.

A nice example on how this library can be used can be found in the spark-tools project, through the implementation of a generic format converter and a SQL processor for both batch and structured streams.

Prerequisites

  • Java 8 or higher
  • Scala 2.12
  • Apache Spark 3.0.X

Getting Spark Utils

Spark Utils is published to Maven Central and Spark Packages:

  • Group id / organization: org.tupol
  • Artifact id / name: spark-utils
  • Latest stable versions:
    • Spark 2.4: 0.4.2
    • Spark 3.0: 0.6.1

Usage with SBT, adding a dependency to the latest version of tools to your sbt build definition file:

libraryDependencies += "org.tupol" %% "spark-utils" % "0.6.2"

Include this package in your Spark Applications using spark-shell or spark-submit

$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-utils_2.12:0.4.2

Starting a New spark-utils Project

The simplest way to start a new spark-utils is to make use of the spark-apps.seed.g8 template project.

To fill in manually the project options run

g8 tupol/spark-apps.seed.g8

The default options look like the following:

name [My Project]:
appname [My First App]:
organization [my.org]:
version [0.0.1-SNAPSHOT]:
package [my.org.my_project]:
classname [MyFirstApp]:
scriptname [my-first-app]:
scalaVersion [2.11.12]:
sparkVersion [2.4.0]:
sparkUtilsVersion [0.4.0]:

To fill in the options in advance

g8 tupol/spark-apps.seed.g8 --name="My Project" --appname="My App" --organization="my.org" --force

What's new?

0.6.2

  • Fixed core dependency to scala-utils; now using scala-utils-core
  • Refactored the core/implicits package to make the implicits a little more explicit

For previous versions please consult the release notes.

License

This code is open source software licensed under the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].