All Projects → tikal-fuseday → Delta Architecture

tikal-fuseday / Delta Architecture

Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline

Projects that are alternatives of or similar to Delta Architecture

Wedatasphere
WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!
Stars: ✭ 372 (+765.12%)
Mutual labels:  kafka, spark
Kafka Streams
equivalent to kafka-streams 🐙 for nodejs ✨🐢🚀✨
Stars: ✭ 613 (+1325.58%)
Mutual labels:  streams, kafka
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+860.47%)
Mutual labels:  kafka, spark
Zat
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
Stars: ✭ 303 (+604.65%)
Mutual labels:  kafka, spark
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+1800%)
Mutual labels:  kafka, spark
Wirbelsturm
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
Stars: ✭ 332 (+672.09%)
Mutual labels:  kafka, spark
Sparta
Real Time Analytics and Data Pipelines based on Spark Streaming
Stars: ✭ 513 (+1093.02%)
Mutual labels:  kafka, spark
Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+474.42%)
Mutual labels:  kafka, spark
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+1593.02%)
Mutual labels:  kafka, spark
Freestyle
A cohesive & pragmatic framework of FP centric Scala libraries
Stars: ✭ 627 (+1358.14%)
Mutual labels:  kafka, spark
Kafka Ui
Open-Source Web GUI for Apache Kafka Management
Stars: ✭ 230 (+434.88%)
Mutual labels:  streams, kafka
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+1869.77%)
Mutual labels:  kafka, spark
Kafka Book
《Kafka技术内幕》代码
Stars: ✭ 175 (+306.98%)
Mutual labels:  streams, kafka
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-13.95%)
Mutual labels:  kafka, spark
Every Single Day I Tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+479.07%)
Mutual labels:  kafka, spark
God Of Bigdata
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Stars: ✭ 6,008 (+13872.09%)
Mutual labels:  kafka, spark
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+402.33%)
Mutual labels:  kafka, spark
Video Stream Analytics
Stars: ✭ 240 (+458.14%)
Mutual labels:  kafka, spark
Go Streams
A lightweight stream processing library for Go
Stars: ✭ 615 (+1330.23%)
Mutual labels:  streams, kafka
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+1820.93%)
Mutual labels:  kafka, spark

WORK-IN-PROGRESS

delta-architecture

Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline (Medium.com)

This is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline

See medium post for more details

High Level Strategy Overview

  • Debezium reads database logs, produces json messages that describe the changes and streams them to Kafka
  • Kafka streams the messages and stores them in a S3 folder. We call it Bronze table as it stores raw messages
  • Using Spark with Delta Lake we transform the messages to INSERT, UPDATE and DELETE operations, and run them on the target data lake table. This is the table that holds the latest state of all source databases. We call it Silver table
  • Next we can perform further aggregations on the Silver table for analytics. We call it Gold table

Components

  • compose: Docker-Compose configuration that deploys containers with Debezium stack (Kafka, Zookeepr and Kafka-Connect), reads changes from the source databases and streams them to S3
  • voter-processing: Notebook with PySpark code that transforms Debezium messages to INSERT, UPDATE and DELETE operations
  • fake_it: For an end-to-end example, a simulator of a voters book application's database with live input
  • analytics: a spark job that simulates reading all history versions from delta lake, and then storing the most updated data, for each poll.

Instructions

Start up docker compose

  • export DEBEZIUM_VERSION=1.0
  • cd compose
  • docker-compose up -d

Config Debezium connector

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8084/connectors/ -d @debezium/config.json

Run spark notebook

Import the notebook file in \voter-processing\voter-processing.html to a Databricks Community account and follow the instructions inside the notebook

https://community.cloud.databricks.com/

TODO - To complete the end-to-end example flow

  • Change the voter-processing from notebook to PySpark application
  • Add the PySpark application to the Docker-Compose
  • Change the configurations so that Kafka writes to local file system instead of S3
  • Change the Spark application so that it read Kafka's output instead of generating it's own mock data

What's Next?

Make it a configurable generic tool that can be assembled on top of any supported database

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].