All Categories → Data Processing → data-engineering

Top 96 data-engineering open source projects

Every Single Day I Tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Elastik Nearest Neighbors
Go to: https://github.com/alexklibisz/elastiknn
Ploomber
A convention over configuration workflow orchestrator. Develop locally (Jupyter or your favorite editor), deploy to Airflow or Kubernetes.
Gspread Pandas
A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.
Aws Serverless Data Lake Framework
Enterprise-grade, production-hardened, serverless data lake on AWS
Soda Sql
Metric collection, data testing and monitoring for SQL accessible data
Yuniql
Free and open source schema versioning and database migration made natively with .NET Core.
Data Engineering Nanodegree
Projects done in the Data Engineering Nanodegree by Udacity.com
Gcp Data Engineer Exam
Study materials for the Google Cloud Professional Data Engineering Exam
Data Engineering Howto
A list of useful resources to learn Data Engineering from scratch
Accelerator
The Accelerator is a tool for fast and reproducible processing of large amounts of data.
Airflow Autoscaling Ecs
Airflow Deployment on AWS ECS Fargate Using Cloudformation
Pipelinex
PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
Aws Data Wrangler
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Spark Alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Ansible Playbook
Ansible playbook to deploy distributed technologies
Waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Quilt
Quilt is a self-organizing data hub for S3
Dbt Sqlserver
dbt adapter for SQL Server and Azure SQL
Data Science On Gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Lakefs
Git-like capabilities for your object storage
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Pointblank
Data validation and organization of metadata for data frames and database tables
Data Engineering Book
Accumulated knowledge and experience in the field of Data Engineering
Udacity Data Engineering Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Active workflow
Turn complex requirements to workflows without leaving the comfort of your technology stack.
Awesome Opensource Data Engineering
An Awesome List of Open-Source Data Engineering Projects
Learn Something Every Day
📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->
Dataform
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
Egeria
Open Metadata and Governance
etl manager
A python package to create a database on the platform using our moj data warehousing framework
ClassifyBot
Automate building ML classification pipelines in .NET
pangeo-forge-recipes
Python library for building Pangeo Forge recipes.
yt-channels-DS-AI-ML-CS
A comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
mpc-DL-controller
Deep Neural Network architecture as a predictive optimal controller for {HVAC+Solar cell + battery} disturbance afflicted system vs classic Model Predictive Control
DataEngineering
This repo contains commands that data engineers use in day to day work.
1-60 of 96 data-engineering projects