All Projects → Ranlot → spark-streaming-visualize

Ranlot / spark-streaming-visualize

Licence: other
Simple demonstration of how to build a complex real time machine learning visualization tool.

Programming Languages

python
139335 projects - #7 most used programming language
scala
5932 projects
shell
77523 projects
HTML
75241 projects

Projects that are alternatives of or similar to spark-streaming-visualize

Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+1443.75%)
Mutual labels:  apache-spark, streaming-data
Awesome Kafka
A list about Apache Kafka
Stars: ✭ 397 (+2381.25%)
Mutual labels:  apache-spark, streaming-data
SparkProgrammingInScala
Apache Spark Course Material
Stars: ✭ 57 (+256.25%)
Mutual labels:  apache-spark
connected-component
Map Reduce Implementation of Connected Component on Apache Spark
Stars: ✭ 68 (+325%)
Mutual labels:  apache-spark
MQL5-JSON-API
Metaquotes MQL5 - JSON - API
Stars: ✭ 183 (+1043.75%)
Mutual labels:  zeromq
re-gent
A Distributed Clojure agent for running remote functions
Stars: ✭ 18 (+12.5%)
Mutual labels:  zeromq
icicle
Icicle Streaming Query Language
Stars: ✭ 16 (+0%)
Mutual labels:  streaming-data
spark-sql-internals
The Internals of Spark SQL
Stars: ✭ 331 (+1968.75%)
Mutual labels:  apache-spark
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-18.75%)
Mutual labels:  apache-spark
ZMQ.jl
Julia interface to ZMQ
Stars: ✭ 114 (+612.5%)
Mutual labels:  zeromq
OpenLogReplicator
Open Source Oracle database CDC written purely in C++. Reads transactions directly from database redo log files and streams in JSON or Protobuf format to: Kafka, RocketMQ, flat file, network stream (plain TCP/IP or ZeroMQ)
Stars: ✭ 112 (+600%)
Mutual labels:  zeromq
spark-transformers
Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.
Stars: ✭ 39 (+143.75%)
Mutual labels:  apache-spark
zerorpc-dotnet
A .NET implementation of ZeroRPC
Stars: ✭ 21 (+31.25%)
Mutual labels:  zeromq
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+618.75%)
Mutual labels:  apache-spark
net.jgp.books.spark.ch01
Spark in Action, 2nd edition - chapter 1 - Introduction
Stars: ✭ 72 (+350%)
Mutual labels:  apache-spark
spark-gradle-template
Apache Spark in your IDE with gradle
Stars: ✭ 39 (+143.75%)
Mutual labels:  apache-spark
richflow
A Node.js and JavaScript synchronous data pipeline processing, data sharing and stream processing library. Actionable & Transformable Pipeline data processing.
Stars: ✭ 17 (+6.25%)
Mutual labels:  streaming-data
spark-operator
Operator for managing the Spark clusters on Kubernetes and OpenShift.
Stars: ✭ 129 (+706.25%)
Mutual labels:  apache-spark
transit
Massively real-time city transit streaming application
Stars: ✭ 20 (+25%)
Mutual labels:  streaming-data
pravega-samples
Sample Applications for Pravega.
Stars: ✭ 43 (+168.75%)
Mutual labels:  streaming-data

"Real-time predictive analytics" has emerged as a topic of growing interest in the data science community. One factor contributing to the appeal of statistical learning methods based on live streaming data is the ability to generate models that react and adapt themselves to non-stationary data distribution in real time (as opposed to batch processing that needs to retrain models periodically).

While numerous implementations of online machine learning algorithms are publicly available, it it is not always easy to find candid demonstrations of how to incorporate them into a lightweight real time visualization platform. Among its many applications one can see how such a tool would allow not only a deeper insight into the dynamics of the models as well as the possibility of being quickly alerted when models start to misbehave.

The purpose of this project is to provide a simple demonstration of how one may "hack" together such a flow of data. Obviously, one should regard this as basic toy tutorial "do it yourself" in order to get started rather than a complete real world implementation.

  • For the sake of simplicity, we prepare a synthetic data set consisting of random points (y, x1, x2) which approximately satisfy the following linear relationship y = c1 x1 + c2 x2 + noise where the coefficients (c1, c2) and the intensity of the noise serve as control parameters.
  • Adopting supervised learning terminology, one may refer to y as a label and to each instance of (x1, x2) as a feature vector. Naturally, the objective then becomes to uncover the values of the coefficients (c1, c2) given the feature vectors and their labels.

  • In order to mimic streaming data, one can generate batches of feature vectors and labels (60 at a time in our case) and save them as new HDFS files every second or so in a directory that the spark streaming application uses a input source.

(You can do this by running the bash script dataStreamer.sh directly from the command line.)

  • Every time a new batch of data is produced, the spark application applies a least squares minimizer (StreamingLinearRegressionWithSGD in our case) which updates the regression coefficients (c1, c2).

(You can do this by running linearPublisher.scala directly from your IDE for simplicity)

Of course, in a real world scenario, generating real time labels would probably have its own intrinsic ambiguities depending on the particular business you happen to be operating in. Furthermore, the underlying data would not be from a simple bash script but would come from more sophisticated destinations such as IoT devices, financial / weather / social network updates....

  • The final step consists in providing a real time visualization of the model and of its history. This can be accomplished through the publish-subscribe messaging pattern by using ZeroMQ. In our case, the spark streaming application acts as the publisher in order to communicates via a TCP socket with a HTTP web server which acts as the subscriber and prepares a visual rendering of the dynamics of the model (localhost:5556).

(For this, you'll need to have started the flask server by running flaskSubscriber.py)

  • The illustration provides a cartoon summary of the flow of data described above:

Disclaimer:

As claimed in the beginning, this project is intended to be a demonstration / tutorial showing that a complex visualization system requiring the wiring together of many disparate technologies can be accomplished quite simply in a few lines of code. As such, no special care has been given to "portability" or "professionalism". Rather, the whole enterprise should be considered as a "hack" that may (hopefully) be a source of inspiration for others.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].