ButterfreeA tool for building feature stores.
Stars: ✭ 126 (+800%)
versatile-data-kitVersatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (+928.57%)
polygon-etlETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+278.57%)
AirbyteAirbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+35035.71%)
hamiltonA scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+4271.43%)
gallia-coreA schema-aware Scala library for data transformation
Stars: ✭ 44 (+214.29%)
blockchain-etl-streamingStreaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
Stars: ✭ 57 (+307.14%)
beneathBeneath is a serverless real-time data platform ⚡️
Stars: ✭ 65 (+364.29%)
Aws Data WranglerPandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+16935.71%)
morph-kgcPowerful RDF Knowledge Graph Generation with [R2]RML Mappings
Stars: ✭ 77 (+450%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+4421.43%)
etl[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library
Stars: ✭ 279 (+1892.86%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+464.29%)
DataformDataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
Stars: ✭ 342 (+2342.86%)
SaynData processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Stars: ✭ 79 (+464.29%)
BenthosFancy stream processing made operationally mundane
Stars: ✭ 3,705 (+26364.29%)
AirflowETLBlog post on ETL pipelines with Airflow
Stars: ✭ 20 (+42.86%)
uptasticsearchAn Elasticsearch client tailored to data science workflows.
Stars: ✭ 47 (+235.71%)
astroAstro allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
Stars: ✭ 79 (+464.29%)
AddaxAddax is an open source universal ETL tool that supports most of those RDBMS and NoSQLs on the planet, helping you transfer data from any one place to another.
Stars: ✭ 615 (+4292.86%)
DataEngineeringThis repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (+235.71%)
oesophagusEnterprise Grade Single-Step Streaming Data Infrastructure Setup. (Under Development)
Stars: ✭ 12 (-14.29%)
lineageGenerate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (+14.29%)
sync-addonsOdoo Integration Addons
Stars: ✭ 69 (+392.86%)
gamechanger-dataGAMECHANGER aspires to be the Department’s trusted solution for evidence-based, data-driven decision-making across the universe of DoD requirements
Stars: ✭ 17 (+21.43%)
DataXServer为DataX(https://github.com/alibaba/DataX) 提供远程多语言调用(ThriftServer,HttpServer) 分布式运行(DataX on YARN) 功能
Stars: ✭ 130 (+828.57%)
sparklanesA lightweight data processing framework for Apache Spark
Stars: ✭ 17 (+21.43%)
basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+78.57%)
ml-in-productionThe practical use-cases of how to make your Machine Learning Pipelines robust and reliable using Apache Airflow.
Stars: ✭ 29 (+107.14%)
openrefine-dockerOpenRefine is a free, open source power tool for working with messy data and improving it. This repository contains Dockerbuild files for automated builds.
Stars: ✭ 19 (+35.71%)
kafka-connect-datagenA Kafka Connect source connector that generates data for tests
Stars: ✭ 27 (+92.86%)
cardano-pyPython3 lib and cli for operating a Cardano Passive Node and using the API's. (PRE-ALPHA)
Stars: ✭ 17 (+21.43%)
jobAnalytics and searchJobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (+78.57%)
dbddbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.
Stars: ✭ 30 (+114.29%)
etlM-Lab ingestion pipeline
Stars: ✭ 15 (+7.14%)
maxwell-sinkconsume maxwell generated message from kafka,export it to another mysql.
Stars: ✭ 16 (+14.29%)
mlbgamedayMulti-core processing of 'Gameday' data from Major League Baseball Advanced Media. Additional tools to parallelize large data sets and write them to a database.
Stars: ✭ 37 (+164.29%)
persistityA persistence framework for game developers
Stars: ✭ 34 (+142.86%)
bandar-logMonitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.
Stars: ✭ 20 (+42.86%)
ClassifyBotAutomate building ML classification pipelines in .NET
Stars: ✭ 16 (+14.29%)
cpp-can-isotpC++ implementation of CAN ISO 15765-2 also known as CAN ISO transport protocol. CPP CAN isotp.
Stars: ✭ 14 (+0%)
dswarman open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)
Stars: ✭ 57 (+307.14%)
openrefine-clientThe OpenRefine Python Client from Paul Makepeace provides a library for communicating with an OpenRefine server. This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, Mac). It is also available via Docker Hub, PyPI and Binder.
Stars: ✭ 67 (+378.57%)
yt-channels-DS-AI-ML-CSA comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
Stars: ✭ 1,038 (+7314.29%)
dflibIn-memory Java DataFrame library
Stars: ✭ 50 (+257.14%)
TEAMThe Taxonomy for ETL Automation Metadata (TEAM) is a metadata management tool for data warehouse automation. It is part of the ecosystem for data warehouse automation, alongside the Virtual Data Warehouse pattern manager and the generic schema for Data Warehouse Automation.
Stars: ✭ 27 (+92.86%)
mydataharbor🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步,主要定位是为实时交易系统服务,亦可用于大数据的数据同步(ETL领域)。
Stars: ✭ 28 (+100%)