AirbyteAirbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+22259.09%)
versatile-data-kitVersatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (+554.55%)
astroAstro allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
Stars: ✭ 79 (+259.09%)
morph-kgcPowerful RDF Knowledge Graph Generation with [R2]RML Mappings
Stars: ✭ 77 (+250%)
etl[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library
Stars: ✭ 279 (+1168.18%)
gallia-coreA schema-aware Scala library for data transformation
Stars: ✭ 44 (+100%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+259.09%)
beneathBeneath is a serverless real-time data platform ⚡️
Stars: ✭ 65 (+195.45%)
SaynData processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Stars: ✭ 79 (+259.09%)
DataformDataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
Stars: ✭ 342 (+1454.55%)
ButterfreeA tool for building feature stores.
Stars: ✭ 126 (+472.73%)
AirflowETLBlog post on ETL pipelines with Airflow
Stars: ✭ 20 (-9.09%)
uptasticsearchAn Elasticsearch client tailored to data science workflows.
Stars: ✭ 47 (+113.64%)
BenthosFancy stream processing made operationally mundane
Stars: ✭ 3,705 (+16740.91%)
dbddbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.
Stars: ✭ 30 (+36.36%)
wikirepoPython based Wikidata framework for easy dataframe extraction
Stars: ✭ 33 (+50%)
polygon-etlETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+140.91%)
etl managerA python package to create a database on the platform using our moj data warehousing framework
Stars: ✭ 14 (-36.36%)
Aws Data WranglerPandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+10740.91%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+2777.27%)
hamiltonA scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+2681.82%)
blockchain-etl-streamingStreaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
Stars: ✭ 57 (+159.09%)
kozaData transformation framework for LinkML data models
Stars: ✭ 21 (-4.55%)
persistityA persistence framework for game developers
Stars: ✭ 34 (+54.55%)
mlbgamedayMulti-core processing of 'Gameday' data from Major League Baseball Advanced Media. Additional tools to parallelize large data sets and write them to a database.
Stars: ✭ 37 (+68.18%)
es2postgresElasticSearch to PostgreSQL loader
Stars: ✭ 18 (-18.18%)
dswarman open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)
Stars: ✭ 57 (+159.09%)
oesophagusEnterprise Grade Single-Step Streaming Data Infrastructure Setup. (Under Development)
Stars: ✭ 12 (-45.45%)
gamechanger-dataGAMECHANGER aspires to be the Department’s trusted solution for evidence-based, data-driven decision-making across the universe of DoD requirements
Stars: ✭ 17 (-22.73%)
dflibIn-memory Java DataFrame library
Stars: ✭ 50 (+127.27%)
mydataharbor🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步,主要定位是为实时交易系统服务,亦可用于大数据的数据同步(ETL领域)。
Stars: ✭ 28 (+27.27%)
lineageGenerate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-27.27%)
PDAP-ScrapersCode relating to scraping public police data.
Stars: ✭ 72 (+227.27%)
viewflowViewflow is an Airflow-based framework that allows data scientists to create data models without writing Airflow code.
Stars: ✭ 110 (+400%)
openrefine-clientThe OpenRefine Python Client from Paul Makepeace provides a library for communicating with an OpenRefine server. This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, Mac). It is also available via Docker Hub, PyPI and Binder.
Stars: ✭ 67 (+204.55%)
DataEngineeringThis repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (+113.64%)
DataBridge.NETConfigurable data bridge for permanent ETL jobs
Stars: ✭ 16 (-27.27%)
go-bqloaderbqloader is a simple ETL framework to load data from Cloud Storage into BigQuery.
Stars: ✭ 16 (-27.27%)
kuwalaKuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (+2054.55%)
cobrixA COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Stars: ✭ 109 (+395.45%)
cubetlCubETL - Framework and tool for data ETL (Extract, Transform and Load) in Python (PERSONAL PROJECT / SELDOM MAINTAINED)
Stars: ✭ 21 (-4.55%)
openrefine-dockerOpenRefine is a free, open source power tool for working with messy data and improving it. This repository contains Dockerbuild files for automated builds.
Stars: ✭ 19 (-13.64%)
yt-channels-DS-AI-ML-CSA comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
Stars: ✭ 1,038 (+4618.18%)
DataXServer为DataX(https://github.com/alibaba/DataX) 提供远程多语言调用(ThriftServer,HttpServer) 分布式运行(DataX on YARN) 功能
Stars: ✭ 130 (+490.91%)
DaFlowApache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (+9.09%)
neon-workshopA Pachyderm deep learning tutorial for conference workshops
Stars: ✭ 19 (-13.64%)
sparklanesA lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-22.73%)
wrangleA data transformation package for deep learning with Autonomio, Keras and TensorFlow.
Stars: ✭ 15 (-31.82%)
mikThe Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
Stars: ✭ 32 (+45.45%)
TEAMThe Taxonomy for ETL Automation Metadata (TEAM) is a metadata management tool for data warehouse automation. It is part of the ecosystem for data warehouse automation, alongside the Virtual Data Warehouse pattern manager and the generic schema for Data Warehouse Automation.
Stars: ✭ 27 (+22.73%)