All Projects → polomarcus → Spark Structured Streaming Examples

polomarcus / Spark Structured Streaming Examples

Licence: apache-2.0
Spark Structured Streaming / Kafka / Cassandra / Elastic

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Spark Structured Streaming Examples

Iot Traffic Monitor
Stars: ✭ 131 (-22.02%)
Mutual labels:  kafka, spark, cassandra
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+28.57%)
Mutual labels:  kafka, spark, cassandra
Freestyle
A cohesive & pragmatic framework of FP centric Scala libraries
Stars: ✭ 627 (+273.21%)
Mutual labels:  kafka, spark, cassandra
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-42.26%)
Mutual labels:  kafka, spark, cassandra
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+404.17%)
Mutual labels:  kafka, spark, cassandra
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+6442.26%)
Mutual labels:  kafka, spark
Bigdata Notebook
Stars: ✭ 100 (-40.48%)
Mutual labels:  kafka, spark
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+6672.62%)
Mutual labels:  kafka, spark
Elassandra
Elassandra = Elasticsearch + Apache Cassandra
Stars: ✭ 1,610 (+858.33%)
Mutual labels:  spark, cassandra
Example Spark Kafka
Apache Spark and Apache Kafka integration example
Stars: ✭ 120 (-28.57%)
Mutual labels:  kafka, spark
Abris
Avro SerDe for Apache Spark structured APIs.
Stars: ✭ 130 (-22.62%)
Mutual labels:  kafka, spark
My Moments
Instagram Clone - Cloning Instagram for learning purpose
Stars: ✭ 140 (-16.67%)
Mutual labels:  kafka, cassandra
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-45.24%)
Mutual labels:  kafka, spark
Seldon Server
Machine Learning Platform and Recommendation Engine built on Kubernetes
Stars: ✭ 1,435 (+754.17%)
Mutual labels:  kafka, spark
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-47.02%)
Mutual labels:  spark, cassandra
Thingsboard
Open-source IoT Platform - Device management, data collection, processing and visualization.
Stars: ✭ 10,526 (+6165.48%)
Mutual labels:  kafka, spark
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-16.67%)
Mutual labels:  kafka, spark
Technology Talk
汇总java生态圈常用技术框架、开源中间件,系统架构、数据库、大公司架构案例、常用三方类库、项目管理、线上问题排查、个人成长、思考等知识
Stars: ✭ 12,136 (+7123.81%)
Mutual labels:  kafka, spark
Dcos Commons
DC/OS SDK is a collection of tools, libraries, and documentation for easy integration of technologies such as Kafka, Cassandra, HDFS, Spark, and TensorFlow with DC/OS.
Stars: ✭ 162 (-3.57%)
Mutual labels:  kafka, cassandra
Awesome Recommendation Engine
The purpose of this tiny project is to put things together with the know how that i learned from the course big data expert from formacionhadoop.com The idea is to show how to play with apache spark streaming, kafka,mongo, spark machine learning algorithms.
Stars: ✭ 47 (-72.02%)
Mutual labels:  kafka, spark

Kafka / Cassandra / Elastic with Spark Structured Streaming

Codacy Badge

Stream the number of time Drake is broadcasted on each radio. And also, see how easy is Spark Structured Streaming to use using Spark SQL's Dataframe API

Run the Project

Step 1 - Start containers

Start the ZooKeeper, Kafka, Cassandra containers in detached mode (-d)

./start-docker-compose.sh

It will run these 2 commands together so you don't have to

docker-compose up -d
# create Cassandra schema
docker-compose exec cassandra cqlsh -f /schema.cql;

# confirm schema
docker-compose exec cassandra cqlsh -e "DESCRIBE SCHEMA;"

Step 2 - start spark structured streaming

sbt run

Run the project after another time

As checkpointing enables us to process our data exactly once, we need to delete the checkpointing folders to re run our examples.

rm -rf checkpoint/
sbt run

Monitor

docker-compose exec kafka  \
 kafka-console-consumer --bootstrap-server localhost:9092 --topic test --from-beginning

Examples:

{"radio":"nova","artist":"Drake","title":"From Time","count":18}
{"radio":"nova","artist":"Drake","title":"4pm In Calabasas","count":1}

Requirements

Linux

curl -L https://github.com/docker/compose/releases/download/1.17.1/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose

MacOS

brew install docker-compose

Input data

Coming from radio stations stored inside a parquet file, the stream is emulated with .option("maxFilesPerTrigger", 1) option.

The stream is after read to be sink into Kafka. Then, Kafka to Cassandra

Output data

Stored inside Kafka and Cassandra for example only. Cassandra's Sinks uses the ForeachWriter and also the StreamSinkProvider to compare both sinks.

One is using the Datastax's Cassandra saveToCassandra method. The other another method, messier (untyped), that uses CQL on a custom foreach loop.

From Spark's doc about batch duration:

Trigger interval: Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will attempt to trigger at the next trigger point, not immediately after the processing has completed.

Kafka topic

One topic test with only one partition

List all topics

docker-compose exec kafka  \
  kafka-topics --list --zookeeper zookeeper:32181

Send a message to be processed

docker-compose exec kafka  \
 kafka-console-producer --broker-list localhost:9092 --topic test

> {"radio":"skyrock","artist":"Drake","title":"Hold On We’Re Going Home","count":38}

Cassandra Table

There are 3 tables. 2 used as sinks, and another to save kafka metadata. Have a look to schema.cql for all the details.

 docker-compose exec cassandra cqlsh -e "SELECT * FROM structuredstreaming.radioOtherSink;"

 radio   | title                    | artist | count
---------+--------------------------+--------+-------
 skyrock |                Controlla |  Drake |     1
 skyrock |                Fake Love |  Drake |     9
 skyrock | Hold On We’Re Going Home |  Drake |    35
 skyrock |            Hotline Bling |  Drake |  1052
 skyrock |  Started From The Bottom |  Drake |    39
    nova |         4pm In Calabasas |  Drake |     1
    nova |             Feel No Ways |  Drake |     2
    nova |                From Time |  Drake |    34
    nova |                     Hype |  Drake |     2

Kafka Metadata

@TODO Verify this below information. Cf this SO comment

When doing an application upgrade, we cannot use checkpointing, so we need to store our offset into a external datasource, here Cassandra is chosen. Then, when starting our kafka source we need to use the option "StartingOffsets" with a json string like

""" {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """

Learn more in the official Spark's doc for Kafka.

In the case, there is not Kafka's metadata stored inside Cassandra, earliest is used.

docker-compose exec cassandra cqlsh -e "SELECT * FROM structuredstreaming.kafkametadata;"
 partition | offset
-----------+--------
         0 |    171

Useful links

Docker-compose

Inspired by

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].