All Projects → gunnarmorling → Awesome Opensource Data Engineering

gunnarmorling / Awesome Opensource Data Engineering

Licence: other
An Awesome List of Open-Source Data Engineering Projects

Projects that are alternatives of or similar to Awesome Opensource Data Engineering

jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-93.44%)
Mutual labels:  data-engineering
arthur-redshift-etl
ELT Code for your Data Warehouse
Stars: ✭ 22 (-94.23%)
Mutual labels:  data-engineering
Around Dataengineering
A Data Engineering & Machine Learning Knowledge Hub
Stars: ✭ 257 (-32.55%)
Mutual labels:  data-engineering
ml-in-production
The practical use-cases of how to make your Machine Learning Pipelines robust and reliable using Apache Airflow.
Stars: ✭ 29 (-92.39%)
Mutual labels:  data-engineering
yt-channels-DS-AI-ML-CS
A comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
Stars: ✭ 1,038 (+172.44%)
Mutual labels:  data-engineering
beneath
Beneath is a serverless real-time data platform ⚡️
Stars: ✭ 65 (-82.94%)
Mutual labels:  data-engineering
versatile-data-kit
Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (-62.2%)
Mutual labels:  data-engineering
Dataform
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
Stars: ✭ 342 (-10.24%)
Mutual labels:  data-engineering
pangeo-forge-recipes
Python library for building Pangeo Forge recipes.
Stars: ✭ 64 (-83.2%)
Mutual labels:  data-engineering
etl manager
A python package to create a database on the platform using our moj data warehousing framework
Stars: ✭ 14 (-96.33%)
Mutual labels:  data-engineering
DataEngineering
This repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (-87.66%)
Mutual labels:  data-engineering
Kaggle-project-list
Summary of my projects on kaggle
Stars: ✭ 20 (-94.75%)
Mutual labels:  data-engineering
AirflowDataPipeline
Example of an ETL Pipeline using Airflow
Stars: ✭ 24 (-93.7%)
Mutual labels:  data-engineering
Data-Engineering-Projects
Personal Data Engineering Projects
Stars: ✭ 167 (-56.17%)
Mutual labels:  data-engineering
Benthos
Fancy stream processing made operationally mundane
Stars: ✭ 3,705 (+872.44%)
Mutual labels:  data-engineering
gallia-core
A schema-aware Scala library for data transformation
Stars: ✭ 44 (-88.45%)
Mutual labels:  data-engineering
growthbook
Open Source Feature Flagging and A/B Testing Platform
Stars: ✭ 2,342 (+514.7%)
Mutual labels:  data-engineering
Learn Something Every Day
📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->
Stars: ✭ 362 (-4.99%)
Mutual labels:  data-engineering
Egeria
Open Metadata and Governance
Stars: ✭ 328 (-13.91%)
Mutual labels:  data-engineering
ClassifyBot
Automate building ML classification pipelines in .NET
Stars: ✭ 16 (-95.8%)
Mutual labels:  data-engineering

= Awesome Open-Source Data Engineering :toc: :toc-placement!:

This https://github.com/topics/awesome-list[Awesome List] aims at providing an overview of https://opensource.org/licenses[open-source] projects related to data engineering. This is a community effort: please https://github.com/gunnarmorling/awesome-opensource-data-engineering/blob/master/CONTRIBUTING.md[contribute] and send your pull requests for growing this list! For a list including non-OSS tools, see this amazing https://github.com/igorbarinov/awesome-data-engineering[Awesome List].

toc::[]

== Analytics

  • https://spark.apache.org/[Apache Spark] - A unified analytics engine for large-scale data processing. Includes APIs in Scala, Java, Python (known as PySpark), and R (SparkR).
  • https://beam.apache.org/[Apache Beam] - An open-source implementation of Google DataFlow. Provides capabilites of batch and streaming data processing jobs that run on any execution engine, including Spark, Flink, or its own DirectRunner. Supports multiple APIs in Java, Python, and Go.
  • https://flink.apache.org/[Apache Flink] - Stateful computations over data streams.

== Business Intelligence

== Change Data Capture

== Datastores

== Data Governance and Registries

== Data Virtualization

== Data Orchestration

  • https://github.com/Alluxio/alluxio[Alluxio] - Scalable, multi-tiered distributed caching for HDFS, S3, Ceph, NFS, and related filestores. Provides integrations for SQL queries into a Catalog from Spark, Hive, and Presto.

== Formats

== Integration

== Messaging Infrastructure

== Specifications and Standards

== Stream Processing

== Testing

== Versioning

== Workflow Management

== Related Resources

only overview contents, no specific tools

=== Slide Decks, Recordings and Podcasts

=== Blog Posts and Articles

=== Collections

== License

The contents of this repository is licensed under the "Creative Commons Attribution-ShareAlike 4.0 International License".

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].