Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-64.55%)

Mutual labels: data-pipeline

qsv

CSVs sliced, diced & analyzed.

Stars: ✭ 438 (+298.18%)

Mutual labels: data-engineering

Azure-Certification-DP-200

Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution

Stars: ✭ 54 (-50.91%)

Mutual labels: data-engineering

ob bulkstash

Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.

Stars: ✭ 113 (+2.73%)

Mutual labels: data-pipeline

get smarties

Dummy variable generation with fit/transform capabilities

Stars: ✭ 23 (-79.09%)

Mutual labels: data-engineering

saisoku

Saisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.

Stars: ✭ 40 (-63.64%)

Mutual labels: data-pipeline

awesome-bigquery-views

Useful SQL queries for Blockchain ETL datasets in BigQuery.

Stars: ✭ 325 (+195.45%)

Mutual labels: data-engineering

lrmr

Less-Resilient MapReduce framework for Go

Stars: ✭ 32 (-70.91%)

Mutual labels: data-engineering

datart

Datart is a next generation Data Visualization Open Platform

Stars: ✭ 1,042 (+847.27%)

Mutual labels: data-engineering

etl

[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library

Stars: ✭ 279 (+153.64%)

Mutual labels: data-engineering

morph-kgc

Powerful RDF Knowledge Graph Generation with [R2]RML Mappings

Stars: ✭ 77 (-30%)

Mutual labels: data-engineering

contessa

Easy way to define, execute and store quality rules for your data.

Stars: ✭ 17 (-84.55%)

Mutual labels: data-engineering

Everything-Tech

A collection of online resources to help you on your Tech journey.

Stars: ✭ 396 (+260%)

Mutual labels: data-engineering

machine-learning-data-pipeline

Pipeline module for parallel real-time data processing for machine learning models development and production purposes.

Stars: ✭ 22 (-80%)

Mutual labels: data-pipeline

View All Similar Projects ➔

Practical Data Engineering Project

This is a practical example of a data engineering project with real-estates. The connected blog post about Building a Data Engineering Project in 20 Minutes you can find on my website. Topics are:

Getting the Data – Scraping with BeautifulSoup
Storing on S3-MinIO
Custom Change Data Capture (CDC)
Adding Database features to S3 – Delta Lake & Spark
Machine Learning part – Jupyter Notebook
Ingesting Data Warehouse for low latency – Apache Druid
The UI with Dashboards and more – Apache Superset
Orchestrating everything together – Dagster
DevOps engine – Kubernetes

The Status of the project you find here.

Starting Dagster

To get MinIO, Spark, Kubernetes, etc. ready, check the representive folder in here.

MinIO started
Kubernetes ready
Spark image and role and namespaces ready
cd src/pipelines/real-estate and start dagit with dagit

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

sspaeti-com / practical-data-engineering

Programming Languages

Labels

Projects that are alternatives of or similar to practical-data-engineering

Practical Data Engineering Project

Starting Dagster