All Projects → chermenin → Spark States

chermenin / Spark States

Licence: apache-2.0
Custom state store providers for Apache Spark

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Spark States

Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (+68.67%)
Mutual labels:  spark, apache-spark, spark-streaming, apache
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-55.42%)
Mutual labels:  spark, apache-spark, spark-streaming
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+1973.49%)
Mutual labels:  spark, apache-spark, spark-streaming
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+1019.28%)
Mutual labels:  spark, apache-spark, spark-streaming
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+80.72%)
Mutual labels:  spark, apache-spark, apache
Coolplayspark
酷玩 Spark: Spark 源代码解析、Spark 类库等
Stars: ✭ 3,318 (+3897.59%)
Mutual labels:  spark, apache-spark, spark-streaming
Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+197.59%)
Mutual labels:  spark, apache-spark, spark-streaming
Pyspark Examples
Code examples on Apache Spark using python
Stars: ✭ 58 (-30.12%)
Mutual labels:  spark, spark-streaming, apache
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+855.42%)
Mutual labels:  spark, apache-spark
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-83.13%)
Mutual labels:  spark, apache-spark
Spark Flamegraph
Easy CPU Profiling for Apache Spark applications
Stars: ✭ 30 (-63.86%)
Mutual labels:  spark, apache-spark
Sparklyr
R interface for Apache Spark
Stars: ✭ 775 (+833.73%)
Mutual labels:  spark, apache-spark
Spark Examples
Spark examples
Stars: ✭ 41 (-50.6%)
Mutual labels:  spark, apache-spark
Spark Streaming Monitoring With Lightning
Plot live-stats as graph from ApacheSpark application using Lightning-viz
Stars: ✭ 15 (-81.93%)
Mutual labels:  apache-spark, spark-streaming
Angel
A Flexible and Powerful Parameter Server for large-scale machine learning
Stars: ✭ 6,458 (+7680.72%)
Mutual labels:  spark, spark-streaming
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+777.11%)
Mutual labels:  spark, apache-spark
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-55.42%)
Mutual labels:  spark, spark-streaming
Spark Tda
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
Stars: ✭ 45 (-45.78%)
Mutual labels:  spark, apache-spark
Spark Nkp
Natural Korean Processor for Apache Spark
Stars: ✭ 50 (-39.76%)
Mutual labels:  spark, apache-spark
Apache Spark Internals
The Internals of Apache Spark
Stars: ✭ 1,045 (+1159.04%)
Mutual labels:  spark, apache-spark

Custom state store providers for Apache Spark

Build Status CodeFactor codecov Maven Central

State management extensions for Apache Spark to keep data across micro-batches during stateful stream processing.

Motivation

Out of the box, Apache Spark has only one implementation of state store providers. It's HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach. To avoid OutOfMemory errors, this repository and custom state store providers were created.

Usage

To use the custom state store provider for your pipelines use the following additional configuration for the submit script/ SparkConf:

--conf spark.sql.streaming.stateStore.providerClass="ru.chermenin.spark.sql.execution.streaming.state.RocksDbStateStoreProvider"

Here is some more information about it: https://docs.databricks.com/spark/latest/structured-streaming/production.html

Alternatively, you can use the useRocksDBStateStore() helper method in your application while creating the SparkSession,

import ru.chermenin.spark.sql.execution.streaming.state.implicits._

val spark = SparkSession.builder().master(...).useRocksDBStateStore().getOrCreate()

Note: For the helper methods to be available, you must import the implicits as shown above.

State Timeout

With semantics similar to those of GroupState/ FlatMapGroupWithState, state timeout features have been built directly into the custom state store.

Important points to note when using State Timeouts,

  • Timeouts can be set differently for each streaming query. This relies on queryName and its checkpointLocation.
  • The poll trigger set on a streaming query may or may not be set to a different value than the state expiration.
  • Timeouts are currently based on processing time
  • The timeout will occur once
    1. a fixed duration has elapsed after the entry's creation, or
    2. the most recent replacement (update) of its value, or
    3. its last access
  • Unlike GroupState, the timeout is not eventual as it is independent from query progress
  • Since the processing time timeout is based on the clock time, it is affected by the variations in the system clock (i.e. time zone changes, clock skew, etc.)
  • Timeout may or may not be set to strict expiration at the slight cost of memory. More info here.

There are 2 different ways configure state timeout:

  1. Via additional configuration on SparkConf:

    To set a processing time timeout for all streaming queries in strict mode.

    --conf spark.sql.streaming.stateStore.stateExpirySecs=5
    --conf spark.sql.streaming.stateStore.strictExpire=true
    

    To configure state timeout differently for each query the above configs can be modified to,

    --conf spark.sql.streaming.stateStore.stateExpirySecs.queryName1=5
    --conf spark.sql.streaming.stateStore.stateExpirySecs.queryName2=10
        ...
        ...
    --conf spark.sql.streaming.stateStore.strictExpire=true
    
  2. Via stateTimeout() helper method (recommended way):

    import ru.chermenin.spark.sql.execution.streaming.state.implicits._
    
    val spark: SparkSession = ...
    val streamingDF: DataFrame = ...
    
    streamingDF.writeStream
          .format(...)
          .outputMode(...)
          .trigger(Trigger.ProcessingTime(1000L))
          .queryName("myQuery1")
          .option("checkpointLocation", "chkpntloc")
          .stateTimeout(spark.conf, expirySecs = 5)
          .start()
    
    spark.streams.awaitAnyTermination()
    

    Preferably, the queryName and checkpointLocation can be set directly via the stateTimeout() method, as below:

    streamingDF.writeStream
          .format(...)
          .outputMode(...)
          .trigger(Trigger.ProcessingTime(1000L))
          .stateTimeout(spark.conf, queryName="myQuery1", expirySecs = 5, checkpointLocation ="chkpntloc")
          .start()
    

Note: If queryName is invalid/ unavailable, the streaming query will be tagged as UNNAMED and timeout applicable will be as per the value of spark.sql.streaming.stateStore.stateExpirySecs (which defaults to -1, but can be overridden via SparkConf)

Other state timeout related points (applicable on global and query level),

  • For no timeout, i.e. infinite state, set spark.sql.streaming.stateStore.stateExpirySecs=-1
  • For stateless processing, i.e. no state, set spark.sql.streaming.stateStore.stateExpirySecs=0

Contributing

You're welcome to submit pull requests with any changes for this repository at any time. I'll be very glad to see any contributions.

License

The standard Apache 2.0 license is used for this project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].