Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → tikal-fuseday → Delta Architecture

tikal-fuseday / Delta Architecture

Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline

Labels

html kafka spark databases streams

Projects that are alternatives of or similar to Delta Architecture

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (+765.12%)

Mutual labels: kafka, spark

Kafka Streams

equivalent to kafka-streams 🐙 for nodejs ✨🐢🚀✨

Stars: ✭ 613 (+1325.58%)

Mutual labels: streams, kafka

Agile data code 2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Stars: ✭ 413 (+860.47%)

Mutual labels: kafka, spark

Zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

Stars: ✭ 303 (+604.65%)

Mutual labels: kafka, spark

Bigdataguide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

Stars: ✭ 817 (+1800%)

Mutual labels: kafka, spark

Wirbelsturm

Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

Stars: ✭ 332 (+672.09%)

Mutual labels: kafka, spark

Sparta

Real Time Analytics and Data Pipelines based on Spark Streaming

Stars: ✭ 513 (+1093.02%)

Mutual labels: kafka, spark

Data Accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Stars: ✭ 247 (+474.42%)

Mutual labels: kafka, spark

Kafka Storm Starter

Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

Stars: ✭ 728 (+1593.02%)

Mutual labels: kafka, spark

Freestyle

A cohesive & pragmatic framework of FP centric Scala libraries

Stars: ✭ 627 (+1358.14%)

Mutual labels: kafka, spark

Kafka Ui

Open-Source Web GUI for Apache Kafka Management

Stars: ✭ 230 (+434.88%)

Mutual labels: streams, kafka

Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu

Stars: ✭ 847 (+1869.77%)

Mutual labels: kafka, spark

Kafka Book

《Kafka技术内幕》代码

Stars: ✭ 175 (+306.98%)

Mutual labels: streams, kafka

Real Time Stream Processing Engine

This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.

Stars: ✭ 37 (-13.95%)

Mutual labels: kafka, spark

Every Single Day I Tldr

A daily digest of the articles or videos I've found interesting, that I want to share with you.

Stars: ✭ 249 (+479.07%)

Mutual labels: kafka, spark

God Of Bigdata

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

Stars: ✭ 6,008 (+13872.09%)

Mutual labels: kafka, spark

Gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Stars: ✭ 216 (+402.33%)

Mutual labels: kafka, spark

Video Stream Analytics

Stars: ✭ 240 (+458.14%)

Mutual labels: kafka, spark

Go Streams

A lightweight stream processing library for Go

Stars: ✭ 615 (+1330.23%)

Mutual labels: streams, kafka

Szt Bigdata

深圳地铁大数据客流分析系统🚇🚄🌟

Stars: ✭ 826 (+1820.93%)

Mutual labels: kafka, spark

View All Similar Projects ➔

WORK-IN-PROGRESS

delta-architecture

Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline (Medium.com)

This is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline

See medium post for more details

High Level Strategy Overview

Debezium reads database logs, produces json messages that describe the changes and streams them to Kafka
Kafka streams the messages and stores them in a S3 folder. We call it Bronze table as it stores raw messages
Using Spark with Delta Lake we transform the messages to INSERT, UPDATE and DELETE operations, and run them on the target data lake table. This is the table that holds the latest state of all source databases. We call it Silver table
Next we can perform further aggregations on the Silver table for analytics. We call it Gold table

Components

compose: Docker-Compose configuration that deploys containers with Debezium stack (Kafka, Zookeepr and Kafka-Connect), reads changes from the source databases and streams them to S3
voter-processing: Notebook with PySpark code that transforms Debezium messages to INSERT, UPDATE and DELETE operations
fake_it: For an end-to-end example, a simulator of a voters book application's database with live input
analytics: a spark job that simulates reading all history versions from delta lake, and then storing the most updated data, for each poll.

Instructions

Start up docker compose

export DEBEZIUM_VERSION=1.0
cd compose
docker-compose up -d

Config Debezium connector

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8084/connectors/ -d @debezium/config.json

Run spark notebook

Import the notebook file in \voter-processing\voter-processing.html to a Databricks Community account and follow the instructions inside the notebook

https://community.cloud.databricks.com/

TODO - To complete the end-to-end example flow

Change the voter-processing from notebook to PySpark application
Add the PySpark application to the Docker-Compose
Change the configurations so that Kafka writes to local file system instead of S3
Change the Spark application so that it read Kafka's output instead of generating it's own mock data

What's Next?

Make it a configurable generic tool that can be assembled on top of any supported database

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 43

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗