Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → spirom → Spark Streaming With Kafka

spirom / Spark Streaming With Kafka

Licence: other

Self-contained examples of Apache Spark streaming integrated with Apache Kafka.

Programming Languages

5932 projects

Labels

kafka spark spark-streaming

Projects that are alternatives of or similar to Spark Streaming With Kafka

Example Spark Kafka

Apache Spark and Apache Kafka integration example

Stars: ✭ 120 (-33.33%)

Mutual labels: kafka, spark, spark-streaming

Data Accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Stars: ✭ 247 (+37.22%)

Mutual labels: kafka, spark, spark-streaming

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Stars: ✭ 216 (+20%)

Mutual labels: kafka, spark, spark-streaming

Real Time Analytics and Data Pipelines based on Spark Streaming

Stars: ✭ 513 (+185%)

Mutual labels: kafka, spark, spark-streaming

Azure Event Hubs Spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

Stars: ✭ 140 (-22.22%)

Mutual labels: kafka, spark, spark-streaming

Real Time Stream Processing Engine

This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.

Stars: ✭ 37 (-79.44%)

Mutual labels: kafka, spark, spark-streaming

Spark Mllib Twitter Sentiment Analysis

🌟 ✨ Analyze and visualize Twitter Sentiment on a world map using Spark MLlib

Stars: ✭ 113 (-37.22%)

Mutual labels: spark, spark-streaming

Kinesis Connector for Structured Streaming

Stars: ✭ 120 (-33.33%)

Mutual labels: spark, spark-streaming

Spark Kafka Writer

Write your Spark data to Kafka seamlessly

Stars: ✭ 175 (-2.78%)

Mutual labels: kafka, spark

Iot Traffic Monitor

Stars: ✭ 131 (-27.22%)

Mutual labels: kafka, spark

Bigdata Notebook

Stars: ✭ 100 (-44.44%)

Mutual labels: kafka, spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Stars: ✭ 1,721 (+856.11%)

Mutual labels: spark, spark-streaming

Bigdata Playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Stars: ✭ 177 (-1.67%)

Mutual labels: kafka, spark-streaming

Production Ready Data Integration Product, documentation：

Stars: ✭ 1,856 (+931.11%)

Mutual labels: spark, spark-streaming

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例，还有 Flink 落地应用的大型项目案例（PVUV、日志存储、百亿数据实时去重、监控告警）分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

Stars: ✭ 11,378 (+6221.11%)

Mutual labels: kafka, spark

Machine Learning Platform and Recommendation Engine built on Kubernetes

Stars: ✭ 1,435 (+697.22%)

Mutual labels: kafka, spark

Avro SerDe for Apache Spark structured APIs.

Stars: ✭ 130 (-27.78%)

Mutual labels: kafka, spark

Technology Talk

汇总java生态圈常用技术框架、开源中间件，系统架构、数据库、大公司架构案例、常用三方类库、项目管理、线上问题排查、个人成长、思考等知识

Stars: ✭ 12,136 (+6642.22%)

Mutual labels: kafka, spark

Pyspark Learning

Updated repository

Stars: ✭ 147 (-18.33%)

Mutual labels: spark, spark-streaming

Spark Structured Streaming Examples

Spark Structured Streaming / Kafka / Cassandra / Elastic

Stars: ✭ 168 (-6.67%)

Mutual labels: kafka, spark

View All Similar Projects ➔

Self-contained examples of Spark streaming integrated with Kafka

The goal of this project is to make it easy to experiment with Spark Streaming based on Kafka, by creating examples that run against an embedded Kafka server and an embedded Spark instance. Of course, in making everything easy to work with we also make it perform poorly. It would be a really bad idea to try to learn anything about performance from this project: it's all about functionality, although we sometimes get insight into performance issues by understanding the way the code interacts with RDD partitioning in Spark and topic partitioning in Kafka.

Related projects

This project is derived from the LearningSpark project which explores the full range of Spark APIs from the viewpoint of Scala developers. There is a corresponding, but much less comprehensive Java version at learning-spark-with-java.

The spark-data-sources project is focused on the new experimental APIs introduced in Spark 2.3.0 for developing adapters for external data sources of various kinds. This API is essentially a Java API (developed in Java) to avoid forcing developers to adopt Scala for their data source adapters. Consequently, the example data sources in this project are written in Java, but both Java and Scala usage examples are provided.

Dependencies

The project was created with IntelliJ Idea 14 Community Edition. It is known to work with JDK 1.8, Scala 2.11.12, and Spark 2.3.0 with its Kafka 0.10 shim library on Ubuntu Linux.

It uses the Direct DStream package spark-streaming-kafka-0-10 for Spark Streaming integration with Kafka 0.10.0.1. The details behind this are explained in the Spark 2.3.0 documentation.

Note that, with the release of Spark 2.3.0, the formerly stable Receiver DStream APIs are now deprecated, and the formerly experimental Direct DStream APIs are now stable.

Using the deprecated (Receiver DStream) Kafka 0.8.0 APIs

I've kept around the examples for the older, stable Kafka integration on the kafka0.8 branch

Structured Streaming

There's a separate set of examples for Kafka integration with the new Structured Streaming features (mainstream as of Spark 2.2).

Utilities

File	Purpose
util/DirectServerDemo.scala	Run this first as a Spark-free sanity check for embedded server and clients.
util/EmbeddedKafkaServer.scala	Starting and stopping an embedded Kafka server, and creating and modifying topics.
util/EmbeddedZookeeper.scala	Starting and stopping an embedded Zookeeper.
util/PartitionMapAnalyzer.scala	Support for understanding how subscribed Kafka topics and their Kafka partitions map to partitions in the RDD that is emitted by the Spark stream.
util/SimpleKafkaClient.scala	Directly connect to Kafka without using Spark.
util/SparkKafkaSink.scala	Support for publishing to Kafka topic in parallel from Spark.
util/TemporaryDirectories.scala	Support for creating and cleaning up temporary directories needed for Kafka broker, Zookeeper and Spark streaming.

Basic Examples

File	What's Illustrated
SimpleStreaming.scala	Simple way to set up streaming from a Kafka topic. While this program also publishes to the topic, the publishing does not involve Spark
ExceptionPropagation.scala	Show how call to awaitTermination() throws propagated exceptions.

Partitioning Examples

Partitioning is an important factor in determining the scalability oif Kafka-based streaming applications. In this set of examples you can see the relationship between a number of facets of partitioning.

The number of partitions in the RDD that is being published to a topic -- if indeed this involves an RDD, as the data is often published from a non-Spark application
The number of partitions of the topic itself (usually specified at topic creation)
THe number of partitions in the RDDs created by the Kafka stream
Whether and how messages move between partitions when they are transferred

When running these examples, look for:

The topic partition number that is printed with each ConsumerRecord
After all the records are printed, the number of partitions in the resulting RDD and size of each partition. For example: *** 4 partitions *** partition size = 253 *** partition size = 252 *** partition size = 258 *** partition size = 237

Another way these examples differ from the basic examples above is that Spark is used to publish to the topic. Perhaps surprisingly, this is not completely straightforward, and relies on util/SparkKafkaSink.scala. An alternative approach to this can be found here.

File	What's Illustrated
SimpleStreamingFromRDD.scala	Data is published by Spark from an RDD, but is repartitioned even through the publishing RDD and the topic have the same number of partitions.
SendWithDifferentPartitioning.scala	Send to a topic with different number of partitions.
ControlledPartitioning.scala	When publishing to the topic, explicitly assign each record to a partition.
AddPartitionsWhileStreaming.scala	Partitions can be added to a Kafka topic dynamically. This example shows that an existing stream will not see the data published to the new partitions, and only when the existing streaming context is terminated and a new stream is started from a new context will that data be delivered. The topic is created with three partitions, and so each RDD the stream produces has three partitions as well, even after two more partitions are added to the topic. This is what's received after the first 500 records are published to the topic while it has only three partitions: [1] * got an RDD, size = 500 [1] * 3 partitions [1] * partition size = 155 [1] * partition size = 173 [1] * partition size = 172 When two partitions are added and another 500 messages are published, this is what's received (note both the number of partitions and the number of messages): [1] * got an RDD, size = 288 [1] * 3 partitions [1] * partition size = 98 [1] * partition size = 89 [1] * partition size = 101 When a new stream is subsequently created, the RDDs produced have five partitions, but only two of them contain data, as all the data has been drained from the initial three partitions of the topic, by the first stream. Now all 500 messages (288 + 212) from the second set have been delivered. [2] * got an RDD, size = 212 [2] * 5 partitions [2] * partition size = 0 [2] * partition size = 0 [2] * partition size = 0 [2] * partition size = 112 [2] *** partition size = 100

Other Examples

File	What's Illustrated
MultipleConsumerGroups.scala	Two streams subscribing to the same topic via two consumer groups see all the same data.
MultipleStreams.scala	Two streams subscribing to the same topic via a single consumer group divide up the data. There's an interesting partitioning interaction here as the streams each get data from two fo the four topic partitions, and each produce RDDs with two partitions each.
MultipleTopics.scala	A single stream subscribing to the two topics receives data from both of them. The partitioning behavior here is quite interesting. The topics have three and six partitions respectively. Each RDD has nine partitions. Each RDD partition receives data from exactly one partition of one topic. Hence the output of the PartitionMapAnalyzer: * got an RDD, size = 200 * 9 partitions * partition 1 has 27 records * rdd partition = 1, topic = foo, topic partition = 0, record count = 27. * partition 2 has 15 records * rdd partition = 2, topic = bar, topic partition = 1, record count = 15. * partition 3 has 17 records * rdd partition = 3, topic = bar, topic partition = 0, record count = 17. * partition 4 has 39 records * rdd partition = 4, topic = foo, topic partition = 1, record count = 39. * partition 5 has 34 records * rdd partition = 5, topic = foo, topic partition = 2, record count = 34. * partition 6 has 11 records * rdd partition = 6, topic = bar, topic partition = 3, record count = 11. * partition 7 has 18 records * rdd partition = 7, topic = bar, topic partition = 4, record count = 18. * partition 8 has 20 records * rdd partition = 8, topic = bar, topic partition = 2, record count = 20.
Timestamp.scala	Record timestamps were introduced into Kafka 0.10 as described in KIP-32 and KIP-33. This example sets up two different topics that handle timestamps differently -- topic A has the timestamp set by the broker when it receives the record, while topic B passes through the timestamp provided in the record (either programmatically when the record was created, as shown here, or otherwise automatically by the producer.) Since the record carries information about where its timestamp originates, its easy to subscribe to the two topics to create a single stream, and then examine the timestamp of every received record and its type. NOTE: The use of timestamps to filter topics in the broker, as introduced in Kafka 0.10.1, is blocked on SPARK-18057.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 180

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗