All Projects → yugokato → Spark-and-Kafka_IoT-Data-Processing-and-Analytics

yugokato / Spark-and-Kafka_IoT-Data-Processing-and-Analytics

Licence: other
Final Project for IoT: Big Data Processing and Analytics class. Analyzing U.S nationwide temperature from IoT sensors in real-time

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Spark-and-Kafka IoT-Data-Processing-and-Analytics

Spark Streaming Monitoring With Lightning
Plot live-stats as graph from ApacheSpark application using Lightning-viz
Stars: ✭ 15 (-64.29%)
Mutual labels:  bigdata, spark-streaming
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-19.05%)
Mutual labels:  bigdata, pyspark
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+2247.62%)
Mutual labels:  bigdata, pyspark
Pyspark Learning
Updated repository
Stars: ✭ 147 (+250%)
Mutual labels:  pyspark, spark-streaming
kafka-twitter-spark-streaming
Counting Tweets Per User in Real-Time
Stars: ✭ 38 (-9.52%)
Mutual labels:  pyspark, spark-streaming
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+2111.9%)
Mutual labels:  bigdata, spark-streaming
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+3997.62%)
Mutual labels:  bigdata, spark-streaming
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+414.29%)
Mutual labels:  pyspark, spark-streaming
optimus
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Stars: ✭ 1,351 (+3116.67%)
Mutual labels:  bigdata, pyspark
bigdatatutorial
bigdatatutorial
Stars: ✭ 34 (-19.05%)
Mutual labels:  bigdata, spark-streaming
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+3085.71%)
Mutual labels:  bigdata, pyspark
qs-hadoop
大数据生态圈学习
Stars: ✭ 18 (-57.14%)
Mutual labels:  bigdata, spark-streaming
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (+233.33%)
Mutual labels:  bigdata, spark-streaming
anovos
Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
Stars: ✭ 77 (+83.33%)
Mutual labels:  bigdata, pyspark
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (+19.05%)
Mutual labels:  bigdata, pyspark
v6.dooring.public
可视化大屏解决方案, 提供一套可视化编辑引擎, 助力个人或企业轻松定制自己的可视化大屏应用.
Stars: ✭ 323 (+669.05%)
Mutual labels:  bigdata
learning notes
学习笔记
Stars: ✭ 18 (-57.14%)
Mutual labels:  bigdata
pulsar-user-group-loc-cn
Workspace for China local user group.
Stars: ✭ 19 (-54.76%)
Mutual labels:  bigdata
kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (+1028.57%)
Mutual labels:  pyspark
ai-deployment
关注AI模型上线、模型部署
Stars: ✭ 149 (+254.76%)
Mutual labels:  pyspark

IoT: Real-time Data Processing and Analytics using Apache Spark / Kafka

ProjectOverivew

Table of Contents

  1. Overview
  2. Format of sensor data
  3. Analysis of data
  4. Results

1. Overview

Use case
  • Analyzing U.S nationwide temperature from IoT sensors in real-time
Project Scenario:
  • Multiple temperature sensors are deployed in each U.S state
  • Each sensor regularly sends temperature data to a Kafka server in AWS Cloud (Simulated by feeding 10,000 JSON data by using kafka-console-producer)
  • Kafka client retrieves the streaming data every 3 seconds
  • PySpark processes and analizes them in real-time by using Spark Streming, and show the results
Key Technologies:
  • Apache Spark (Spark Streaming)
  • Apache Kafka
  • Python/PySpark

2. Format of sensor data

I used the simulated data for this project. iotsimulator.py generates JSON data as below format.

<Example>

{
    "guid": "0-ZZZ12345678-08K",
    "destination": "0-AAA12345678",
    "state": "CA", 
    "eventTime": "2016-11-16T13:26:39.447974Z", 
    "payload": {
        "format": "urn:example:sensor:temp", 
        "data":{
            "temperature": 59.7
        }
    }
}
Field Description
guid A global unique identifier which is associated with a sensor.
destination An identifier of the destination which sensors send data to (One single fixed ID is used in this project)
state A randomly chosen U.S state. A same guid always has a same state
eventTime A timestamp that the data is generated
format A format of data
temperature Calculated by continuously adding a random number (between -1.0 to 1.0) to each state's average annual temperature everytime when the data is generated. https://www.currentresults.com/Weather/US/average-annual-state-temperatures.php

If you need to generate 10,000 sensors data:

$ ./iotsimulator.py 10000 > testdata.txt

3. Analysis of data

In this project, I achieved 4 types of real-time analysis.

  • Average temperature by each state (Values sorted in descending order)
  • Total messages processed
  • Number of sensors by each state (Keys sorted in ascending order)
  • Total number of sensors

(1) Average temperature by each state (Values sorted in descending order)

avgTempByState = jsonRDD.map(lambda x: (x['state'], (x['payload']['data']['temperature'], 1))) \
                 .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1])) \
                 .map(lambda x: (x[0], x[1][0]/x[1][1])) 
sortedTemp = avgTempByState.transform(lambda x: x.sortBy(lambda y: y[1], False))
  • In the first .map operation, PySpark creates pair RDDs (k, v) where k is a values of a fileld state, and v is a value of a field temperature with a count of 1
<Example>

('StateA', (50.0, 1))
('StateB', (20.0, 1))
('StateB', (21.0, 1))
('StateC', (70.0, 1))
('StateA', (52.0, 1))
('StateB', (22.0, 1))
...
  • In the next .reduceByKey operation, PySpark aggregates the values by a same key and reduce them to a single entry
<Example>

('StateA', (102.0, 2))  
('StateB', (63.0, 3))  
('StateC', (70.0, 1))  
...
  • In the next .map operation, PySpark calculates the average temperature by deviding the sum of temperature by the total count
<Example>

('StateA', 51.0)
('StateB', 21.0)
('StateC', 70.0)
...
  • Finally, PySpark sorts the value of average temperature in descending order
<Example>

('StateC', 70.0)
('StateA', 51.0)
('StateB', 21.0)
...

(2) Total messages processed

messageCount = jsonRDD.map(lambda x: 1) \
                     .reduce(add) \
                     .map(lambda x: "Total count of messages: "+ unicode(x))
  • Simply appends a count 1 to each entry, and then sums them up

(3) Number of sensors by each state (Keys sorted in ascending order)

numSensorsByState = jsonRDD.map(lambda x: (x['state'] + ":" + x['guid'], 1)) \
                        .reduceByKey(lambda a,b: a*b) \
                        .map(lambda x: (re.sub(r":.*", "", x[0]), x[1])) \
                        .reduceByKey(lambda a,b: a+b)
sortedSensorCount = numSensorsByState.transform(lambda x: x.sortBy(lambda y: y[0], True))
  • In the first .map operation, PySpark creates pair RDDs (k, v) where k is a value of fields state and guid concatenated with ":", and v is a value of count 1
<Example>

('StateB:0-ZZZ12345678-28F', 1)
('StateB:0-ZZZ12345678-30P', 1)
('StateA:0-ZZZ12345678-08K', 1)
('StateC:0-ZZZ12345678-60F', 1)
('StateA:0-ZZZ12345678-08K', 1)
('StateB:0-ZZZ12345678-30P', 1)
...
  • In the next .reduceByKey operation, PySpark aggregates the values by a same key and reduce them to a single entry but the values stay 1
('StateB:0-ZZZ12345678-28F', 1)
('StateB:0-ZZZ12345678-30P', 1)
('StateA:0-ZZZ12345678-08K', 1)
('StateC:0-ZZZ12345678-60F', 1)
...
  • In the next .map operation, PySpark removes characters of ":" and guid
<Example>

('StateB', 1)
('StateB', 1)
('StateA', 1)
('StateC', 1)
...
  • In the last .reduceByKey operation, PySpark aggregates the values by a same key and reduce them to a single entry
<Example>

('StateB', 2)
('StateA', 1)
('StateC', 1)
...
  • Finally, PySpark sorts the values in ascending order
<Example>

('StateA', 1)
('StateB', 2)
('StateC', 1)
...

####(4) Total number of sensors

sensorCount = jsonRDD.map(lambda x: (x['guid'], 1)) \
                     .reduceByKey(lambda a,b: a*b) \
                     .reduce(add) \
                     .map(lambda x: "Total count of sensors: " + unicode(x))
  • In the first .map operation, PySpark creates pair RDDs (k, v) where k is a value of a field guid, and v is a count of 1
<Example>

('0-ZZZ12345678-08K', 1)
('0-ZZZ12345678-28F', 1)
('0-ZZZ12345678-30P', 1)
('0-ZZZ12345678-60F', 1)
('0-ZZZ12345678-08K', 1)
('0-ZZZ12345678-30P', 1)
...
  • In the next .reduceByKey operation, PySpark aggregates the values by a same key and reduce them to a single entry but the values stay 1
<Example>

('0-ZZZ12345678-08K', 1)
('0-ZZZ12345678-28F', 1)
('0-ZZZ12345678-30P', 1)
('0-ZZZ12345678-60F', 1)
...
  • In the next .reduce operation, PySpark sums up all values

4. Results

The result shows console output of Spark Streaming which processed and analyzed 10,000 sensor data in real-time.

[ec2-user@ip-172-31-9-184 ~]$ spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.0.0-preview.jar  \ 
./kafka-direct-iotmsg.py localhost:9092 iotmsgs

<snip>

-------------------------------------------
Time: 2016-11-21 13:30:06
-------------------------------------------

-------------------------------------------
Time: 2016-11-21 13:30:06
-------------------------------------------

-------------------------------------------
Time: 2016-11-21 13:30:06
-------------------------------------------

-------------------------------------------
Time: 2016-11-21 13:30:06
-------------------------------------------

-------------------------------------------
Time: 2016-11-21 13:30:09       <- Average temperature by each state (Values sorted in descending order)
-------------------------------------------
(u'FL', 70.70635838150288)
(u'HI', 70.59879999999998)
(u'LA', 67.0132911392405)
(u'TX', 64.63165467625899)
(u'GA', 64.22095808383233)
(u'AL', 63.29540229885056)
(u'MS', 62.92658730158729)
(u'SC', 62.889361702127644)
(u'AZ', 61.161951219512204)
(u'AR', 60.006074766355134)
(u'CA', 59.56944444444444)
(u'NC', 59.13968253968251)
(u'OK', 59.10108108108111)
(u'DC', 57.916810344827596)
(u'TN', 57.18434782608696)
(u'KY', 56.375510204081664)
(u'DE', 54.6767634854772)
(u'VA', 54.5506726457399)
(u'MD', 54.30196078431374)
(u'KS', 53.60306748466258)
(u'MO', 53.59634146341466)
(u'NM', 53.55384615384617)
(u'NJ', 52.90479452054793)
(u'IN', 52.55497382198954)
(u'IL', 51.9223958333333)
(u'WV', 51.89952380952379)
(u'OH', 50.52346368715085)
(u'NV', 50.38380281690144)
(u'RI', 49.90240963855423)
(u'PA', 49.61223404255321)
(u'UT', 49.00546448087432)
(u'CT', 48.47242990654204)
(u'NE', 47.96193548387097)
(u'OR', 47.908675799086716)
(u'WA', 47.88577777777777)
(u'MA', 47.81961722488036)
(u'IA', 47.54875621890548)
(u'SD', 45.449999999999996)
(u'CO', 45.16935483870966)
(u'NY', 44.81830985915495)
(u'MI', 44.58102564102565)
(u'ID', 44.56483050847461)
(u'NH', 43.39304347826085)
(u'MT', 43.05155709342561)
(u'WY', 42.9689655172414)
(u'VT', 42.668322981366465)
(u'WI', 41.81523809523809)
(u'ME', 41.695061728395046)
(u'MN', 40.348076923076924)
(u'ND', 40.23502538071064)
(u'AK', 26.85450819672129)

-------------------------------------------
Time: 2016-11-21 13:30:09   <- Total messages processed
-------------------------------------------
Total number of messages: 10000

-------------------------------------------
Time: 2016-11-21 13:30:09   <- Number of sensors by each state (Keys sorted in ascending order)
-------------------------------------------
(u'AK', 53)
(u'AL', 34)
(u'AR', 47)
(u'AZ', 40)
(u'CA', 28)
(u'CO', 37)
(u'CT', 41)
(u'DC', 44)
(u'DE', 50)
(u'FL', 39)
(u'GA', 34)
(u'HI', 50)
(u'IA', 45)
(u'ID', 41)
(u'IL', 42)
(u'IN', 41)
(u'KS', 35)
(u'KY', 42)
(u'LA', 36)
(u'MA', 44)
(u'MD', 43)
(u'ME', 38)
(u'MI', 41)
(u'MN', 42)
(u'MO', 50)
(u'MS', 50)
(u'MT', 57)
(u'NC', 41)
(u'ND', 40)
(u'NE', 33)
(u'NH', 41)
(u'NJ', 34)
(u'NM', 37)
(u'NV', 30)
(u'NY', 26)
(u'OH', 42)
(u'OK', 36)
(u'OR', 47)
(u'PA', 41)
(u'RI', 32)
(u'SC', 39)
(u'SD', 39)
(u'TN', 53)
(u'TX', 34)
(u'UT', 36)
(u'VA', 45)
(u'VT', 38)
(u'WA', 45)
(u'WI', 47)
(u'WV', 44)
(u'WY', 42)

-------------------------------------------
Time: 2016-11-21 13:30:09   <- Total number of sensors
-------------------------------------------
Total number of sensors: 2086

-------------------------------------------
Time: 2016-11-21 13:30:12
-------------------------------------------

-------------------------------------------
Time: 2016-11-21 13:30:12
-------------------------------------------

-------------------------------------------
Time: 2016-11-21 13:30:12
-------------------------------------------

-------------------------------------------
Time: 2016-11-21 13:30:12
-------------------------------------------

<snip>

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].