All Projects → lineage → Similar Projects or Alternatives

736 Open source projects that are alternatives of or similar to lineage

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Stars: ✭ 25 (+56.25%)

Mutual labels: pipeline, etl, pyspark

sparklanes

A lightweight data processing framework for Apache Spark

Stars: ✭ 17 (+6.25%)

Mutual labels: pipeline, etl, pyspark

Metl

mito ETL tool

Stars: ✭ 153 (+856.25%)

Mutual labels: pipeline, etl

mydataharbor

🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步，主要定位是为实时交易系统服务，亦可用于大数据的数据同步（ETL领域）。

Stars: ✭ 28 (+75%)

Mutual labels: pipeline, etl

Go Streams

A lightweight stream processing library for Go

Stars: ✭ 615 (+3743.75%)

Mutual labels: pipeline, etl

Bulk Writer

Provides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.

Stars: ✭ 210 (+1212.5%)

Mutual labels: pipeline, etl

python mozetl

ETL jobs for Firefox Telemetry

Stars: ✭ 25 (+56.25%)

Mutual labels: etl, pyspark

Setl

A simple Spark-powered ETL framework that just works 🍺

Stars: ✭ 79 (+393.75%)

Mutual labels: pipeline, etl

Morphl Community Edition

MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization

Stars: ✭ 253 (+1481.25%)

Mutual labels: pipeline, pyspark

Example Airflow Dags

Example DAGs using hooks and operators from Airflow Plugins

Stars: ✭ 243 (+1418.75%)

Mutual labels: etl, dag

Serving

A flexible, high-performance carrier for machine learning models（『飞桨』服务化部署框架）

Stars: ✭ 403 (+2418.75%)

Mutual labels: pipeline, dag

Phila Airflow

Stars: ✭ 16 (+0%)

Mutual labels: pipeline, etl

Airbyte

Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

Stars: ✭ 4,919 (+30643.75%)

Mutual labels: pipeline, etl

etl

M-Lab ingestion pipeline

Stars: ✭ 15 (-6.25%)

Mutual labels: pipeline, etl

Datavec

ETL Library for Machine Learning - data pipelines, data munging and wrangling

Stars: ✭ 272 (+1600%)

Mutual labels: pipeline, etl

Butterfree

A tool for building feature stores.

Stars: ✭ 126 (+687.5%)

Mutual labels: etl, pyspark

naas

⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment

Stars: ✭ 219 (+1268.75%)

Mutual labels: pipeline, etl

Mara Pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

Stars: ✭ 1,841 (+11406.25%)

Mutual labels: pipeline, etl

Aws Ecs Airflow

Run Airflow in AWS ECS(Elastic Container Service) using Fargate tasks

Stars: ✭ 107 (+568.75%)

Mutual labels: etl, dag

Pyspark Example Project

Example project implementing best practices for PySpark ETL jobs and applications.

Stars: ✭ 633 (+3856.25%)

Mutual labels: etl, pyspark

Stetl

Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.

Stars: ✭ 64 (+300%)

Mutual labels: pipeline, etl

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (+143.75%)

Mutual labels: etl, pyspark

hamilton

A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Stars: ✭ 612 (+3725%)

Mutual labels: etl, dag

cobra-policytool

Manage Apache Atlas and Ranger configuration for your Hadoop environment.

Stars: ✭ 16 (+0%)

Mutual labels: atlas

dag

Simple DSL for executing functions in Go

Stars: ✭ 85 (+431.25%)

Mutual labels: dag

bump-everywhere

🚀 Automate versioning, changelog creation, README updates and GitHub releases using GitHub Actions,npm, docker or bash.

Stars: ✭ 24 (+50%)

Mutual labels: pipeline

kubecrypt

Helper for dealing with secrets in kubernetes.

Stars: ✭ 23 (+43.75%)

Mutual labels: pipeline

pipe-trait

Make it possible to chain regular functions

Stars: ✭ 22 (+37.5%)

Mutual labels: pipeline

ruby-for-pentaho-kettle

Ruby scripting for pentaho-kettle

Stars: ✭ 42 (+162.5%)

Mutual labels: etl

textureatlas

A simple, cross-platform Python-based tool and C library for creating and using a texture atlas in your application or game. Distributed under the terms of the MIT license.

Stars: ✭ 20 (+25%)

Mutual labels: atlas

persistity

A persistence framework for game developers

Stars: ✭ 34 (+112.5%)

Mutual labels: etl

flamingo

FreeCAD - flamingo workbench

Stars: ✭ 30 (+87.5%)

Mutual labels: pipeline

dnaPipeTE

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).

Stars: ✭ 28 (+75%)

Mutual labels: pipeline

Spark-for-data-engineers

Apache Spark for data engineers

Stars: ✭ 22 (+37.5%)

Mutual labels: pyspark

katana-skipper

Simple and flexible ML workflow engine

Stars: ✭ 234 (+1362.5%)

Mutual labels: pipeline

kafka-connect-datagen

A Kafka Connect source connector that generates data for tests

Stars: ✭ 27 (+68.75%)

Mutual labels: etl

dswarm

an open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)

Stars: ✭ 57 (+256.25%)

Mutual labels: etl

check-engine

Data validation library for PySpark 3.0.0

Stars: ✭ 29 (+81.25%)

Mutual labels: pyspark

bacannot

Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.

Stars: ✭ 51 (+218.75%)

Mutual labels: pipeline

gallia-core

A schema-aware Scala library for data transformation

Stars: ✭ 44 (+175%)

Mutual labels: etl

nwabap-ui5uploader

This module allows a developer to upload SAPUI5/OpenUI5 sources into a SAP NetWeaver ABAP system.

Stars: ✭ 15 (-6.25%)

Mutual labels: pipeline

KoELECTRA-Pipeline

Transformers Pipeline with KoELECTRA

Stars: ✭ 37 (+131.25%)

Mutual labels: pipeline

hlatyping

Precision HLA typing from next-generation sequencing data

Stars: ✭ 28 (+75%)

Mutual labels: pipeline

go-pdu

Parallel Digital Universe - A decentralized social networking service

Stars: ✭ 39 (+143.75%)

Mutual labels: dag

google classroom

Google Classroom Data Pipeline

Stars: ✭ 17 (+6.25%)

Mutual labels: pipeline

Atlas auto setline

a tool for automatic offline/online unusable slave node in Atlas open source software

Stars: ✭ 47 (+193.75%)

Mutual labels: atlas

web-click-flow

网站点击流离线日志分析

Stars: ✭ 14 (-12.5%)

Mutual labels: etl

oic-options-chains

ETL for OIC Options Chains

Stars: ✭ 22 (+37.5%)

Mutual labels: etl

DataEngineering

This repo contains commands that data engineers use in day to day work.

Stars: ✭ 47 (+193.75%)

Mutual labels: pyspark

MTBseq source

MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.

Stars: ✭ 26 (+62.5%)

Mutual labels: pipeline

taxid-changelog

NCBI taxonomic identifier (taxid) changelog, including taxids deletion, new adding, merge, reuse, and rank/name changes.

Stars: ✭ 13 (-18.75%)

Mutual labels: lineage

rivery cli

Rivery CLI

Stars: ✭ 16 (+0%)

Mutual labels: etl

dflib

In-memory Java DataFrame library