All Projects → yennanliu → NYC_Taxi_Pipeline

yennanliu / NYC_Taxi_Pipeline

Licence: other
Design/Implement stream/batch architecture on NYC taxi data | #DE

Programming Languages

scala
5932 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to NYC Taxi Pipeline

Historical
A serverless, event-driven AWS configuration collection service with configuration versioning.
Stars: ✭ 85 (+431.25%)
Mutual labels:  events, s3
minio-rclone-webdav-server
A @rclone served WebDAV server with @minio as the s3 storage backend docker example
Stars: ✭ 17 (+6.25%)
Mutual labels:  s3
s3x
s3x is a minio gateway providing an S3 API powered by TemporalX that uses IPFS as the data storage layer. It lets you turn any S3 application into an IPFS application with no change in application design
Stars: ✭ 85 (+431.25%)
Mutual labels:  s3
agones-event-broadcaster
Broadcast Agones GameServers and Fleets states to the external world
Stars: ✭ 22 (+37.5%)
Mutual labels:  events
SEPA
Get notifications about changes in your SPARQL endpoint.
Stars: ✭ 21 (+31.25%)
Mutual labels:  events
Hap
A simple concurrent programming language.
Stars: ✭ 19 (+18.75%)
Mutual labels:  events
LetsHack
Notes & HowTo's covering the Raspberry Pi, Arduino, ESP8266, ESP32, etc.
Stars: ✭ 37 (+131.25%)
Mutual labels:  s3
storage
Go library providing common interface for working across multiple cloud storage backends
Stars: ✭ 154 (+862.5%)
Mutual labels:  s3
react-compose-events
A Higher-Order Component factory to attach outside event listeners
Stars: ✭ 25 (+56.25%)
Mutual labels:  events
dom-locky
🙈🙉🙊 - the best way to scope a scroll, or literally any other event.
Stars: ✭ 18 (+12.5%)
Mutual labels:  events
fluent-bit-go-s3
[Deprecated] The predessor of fluent-bit output plugin for Amazon S3. https://aws.amazon.com/s3/
Stars: ✭ 34 (+112.5%)
Mutual labels:  s3
GoblinDB
Fear the Goblin! - An amazing, simple and fun database for humans
Stars: ✭ 54 (+237.5%)
Mutual labels:  events
ionic-image-upload
Ionic Plugin for Uploading Images to Amazon S3
Stars: ✭ 26 (+62.5%)
Mutual labels:  s3
Dive-Into-AWS
Links to the Repos and Sections in our Dive into AWS Course.
Stars: ✭ 27 (+68.75%)
Mutual labels:  s3
events-manager-io
A basic site for managing event centers and scheduling events.
Stars: ✭ 19 (+18.75%)
Mutual labels:  events
Sitko.Core
Sitko.Core is a set of libraries to help build .NET Core applications fast
Stars: ✭ 46 (+187.5%)
Mutual labels:  s3
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (+56.25%)
Mutual labels:  s3
fs2-es
Event sourcing utilities for FS2
Stars: ✭ 75 (+368.75%)
Mutual labels:  events
event-worker
A simpler way of dealing with Web Workers
Stars: ✭ 18 (+12.5%)
Mutual labels:  events
rclone-drive
☁️Simple web cloud storage based on rclone, transform cloud storage (s3, google drive, one drive, dropbox) into own custom web-based storage
Stars: ✭ 30 (+87.5%)
Mutual labels:  s3

NYC Taxi Pipeline

INTRO

Architect batch/stream data processing systems from nyc-tlc-trip-records-data, via the ETL batch process : E (extract : tlc-trip-record-data.page -> S3 ) -> T (transform : S3 -> Spark) -> L (load : Spark -> Mysql) & stream process : Event -> Event digest -> Event storage. The system then can support calculation such as Top Driver By area, Order by time windiw, latest-top-driver, and Top busy areas.

Batch data : nyc-tlc-trip-records-data

Stream data : TaxiEvent, stream from file.

Please also check NYC_Taxi_Trip_Duration in case you are interested in the data science projects with similar taxi dataset.

Architecture

  • Architecture idea (Batch):
  • Architecture idea (Stream):

File structure

├── Dockerfile    : Scala spark Dockerfile
├── build.sbt     : Scala sbt build file
├── config        : configuration files for DB/Kafka/AWS..
├── data          : Raw/processed/output data (batch/stream)
├── doc           : All repo reference/doc/pic
├── elk           : ELK (Elasticsearch, Logstash, Kibana) config/scripts 
├── fluentd       : Fluentd help scripts
├── kafka         : Kafka help scripts
├── pyspark       : Legacy pipeline code (Python)
├── requirements.txt
├── script        : Help scripts (env/services) 
├── src           : Batch/stream process scripts (Scala)
└── utility       : Help scripts (pipeline)

Prerequisites

Prerequisites

Quick start

Quick-Start-Batch-Pipeline-Manually
# STEP 1) Download the dataset
bash script/download_sample_data.sh

# STEP 2) sbt build
sbt compile
sbt assembly

# STEP 3) Load data 
spark-submit \
 --class DataLoad.LoadReferenceData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

spark-submit \
 --class DataLoad.LoadGreenTripData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

spark-submit \
 --class DataLoad.LoadYellowTripData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 4) Transform data 
spark-submit \
 --class DataTransform.TransformGreenTaxiData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

spark-submit \
 --class DataTransform.TransformYellowTaxiData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 5) Create view 
spark-submit \
 --class CreateView.CreateMaterializedView \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 6) Save to JDBC (mysql)
spark-submit \
 --class SaveToDB.JDBCToMysql \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 7) Save to Hive
spark-submit \
 --class SaveToHive.SaveMaterializedviewToHive \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar
Quick-Start-Stream-Pipeline-Manually
# STEP 1) sbt build
abt compile
sbt assembly

# STEP 2) Create Taxi event
spark-submit \
 --class TaxiEvent.CreateBasicTaxiEvent \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# check the event
curl localhost:44444

# STEP 3) Process Taxi event
spark-submit \
 --class EventLoad.SparkStream_demo_LoadTaxiEvent \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 4) Send Taxi event to Kafaka
# start zookeeper, kafka
brew services start zookeeper
brew services start kafka

# create kafka topic
kafka-topics --create -zookeeper localhost:2181 --replication-factor 1  --partitions 1 --topic first_topic
kafka-topics --create -zookeeper localhost:2181 --replication-factor 1  --partitions 1 --topic streams-taxi

# curl event to kafka producer
curl localhost:44444 | kafka-console-producer  --broker-list  127.0.0.1:9092 --topic first_topic

# STEP 5) Spark process kafka stream
spark-submit \
 --class KafkaEventLoad.LoadKafkaEventExample \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 6) Spark process kafka stream
spark-submit \
 --class KafkaEventLoad.LoadTaxiKafkaEventWriteToKafka \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 7) Run elsacsearch, kibana, logstach
# make sure curl localhost:44444 can get the taxi event
cd ~ 
kibana-7.6.1-darwin-x86_64/bin/kibana
elasticsearch-7.6.1/bin/elasticsearch
logstash-7.6.1/bin/logstash -f /Users/$USER/NYC_Taxi_Pipeline/elk/logstash/logstash_taxi_event_file.conf

# test insert toy data to logstash 
# (logstash config: elk/logstash.conf)
#nc 127.0.0.1 5000 < data/event_sample.json

# then visit kibana UI : localhost:5601
# then visit "management" -> "index_patterns" -> "Create index pattern" 
# create new index : logstash-* (not select timestamp as filter)
# then visit the "discover" tag and check the data

Dependency

Dependency
  1. Spark 2.4.3

  2. Java 8

  3. Apache Hadoop 2.7

  4. Jars

  5. build.sbt

Ref

Ref
  • ref.md - dataset link ref, code ref, other ref
  • doc - All ref docs

TODO

TODO
# 1. Tune the main pipeline for large scale data (to process whole nyc-tlc-trip data)
# 2. Add front-end UI (flask to visualize supply & demand and surging price)
# 3. Add test 
# 4. Dockerize the project 
# 5. Tune the spark batch/stream code 
# 6. Tune the kafka, zoopkeeper cluster setting 
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].