basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+56.25%)
sparklanesA lightweight data processing framework for Apache Spark
Stars: ✭ 17 (+6.25%)
Metlmito ETL tool
Stars: ✭ 153 (+856.25%)
mydataharbor🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步,主要定位是为实时交易系统服务,亦可用于大数据的数据同步(ETL领域)。
Stars: ✭ 28 (+75%)
Go StreamsA lightweight stream processing library for Go
Stars: ✭ 615 (+3743.75%)
Bulk WriterProvides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (+1212.5%)
python mozetlETL jobs for Firefox Telemetry
Stars: ✭ 25 (+56.25%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+393.75%)
Morphl Community EditionMorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (+1481.25%)
Example Airflow DagsExample DAGs using hooks and operators from Airflow Plugins
Stars: ✭ 243 (+1418.75%)
ServingA flexible, high-performance carrier for machine learning models(『飞桨』服务化部署框架)
Stars: ✭ 403 (+2418.75%)
AirbyteAirbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+30643.75%)
etlM-Lab ingestion pipeline
Stars: ✭ 15 (-6.25%)
DatavecETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (+1600%)
ButterfreeA tool for building feature stores.
Stars: ✭ 126 (+687.5%)
naas⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (+1268.75%)
Mara PipelinesA lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Stars: ✭ 1,841 (+11406.25%)
Aws Ecs AirflowRun Airflow in AWS ECS(Elastic Container Service) using Fargate tasks
Stars: ✭ 107 (+568.75%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+3856.25%)
StetlStetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (+300%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+143.75%)
hamiltonA scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+3725%)
cobra-policytoolManage Apache Atlas and Ranger configuration for your Hadoop environment.
Stars: ✭ 16 (+0%)
dagSimple DSL for executing functions in Go
Stars: ✭ 85 (+431.25%)
bump-everywhere🚀 Automate versioning, changelog creation, README updates and GitHub releases using GitHub Actions,npm, docker or bash.
Stars: ✭ 24 (+50%)
kubecryptHelper for dealing with secrets in kubernetes.
Stars: ✭ 23 (+43.75%)
pipe-traitMake it possible to chain regular functions
Stars: ✭ 22 (+37.5%)
textureatlasA simple, cross-platform Python-based tool and C library for creating and using a texture atlas in your application or game. Distributed under the terms of the MIT license.
Stars: ✭ 20 (+25%)
persistityA persistence framework for game developers
Stars: ✭ 34 (+112.5%)
flamingoFreeCAD - flamingo workbench
Stars: ✭ 30 (+87.5%)
dnaPipeTEdnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).
Stars: ✭ 28 (+75%)
katana-skipperSimple and flexible ML workflow engine
Stars: ✭ 234 (+1362.5%)
kafka-connect-datagenA Kafka Connect source connector that generates data for tests
Stars: ✭ 27 (+68.75%)
dswarman open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)
Stars: ✭ 57 (+256.25%)
check-engineData validation library for PySpark 3.0.0
Stars: ✭ 29 (+81.25%)
bacannotGeneric but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.
Stars: ✭ 51 (+218.75%)
gallia-coreA schema-aware Scala library for data transformation
Stars: ✭ 44 (+175%)
nwabap-ui5uploaderThis module allows a developer to upload SAPUI5/OpenUI5 sources into a SAP NetWeaver ABAP system.
Stars: ✭ 15 (-6.25%)
hlatypingPrecision HLA typing from next-generation sequencing data
Stars: ✭ 28 (+75%)
go-pduParallel Digital Universe - A decentralized social networking service
Stars: ✭ 39 (+143.75%)
Atlas auto setlinea tool for automatic offline/online unusable slave node in Atlas open source software
Stars: ✭ 47 (+193.75%)
DataEngineeringThis repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (+193.75%)
MTBseq sourceMTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Stars: ✭ 26 (+62.5%)
taxid-changelogNCBI taxonomic identifier (taxid) changelog, including taxids deletion, new adding, merge, reuse, and rank/name changes.
Stars: ✭ 13 (-18.75%)
dflibIn-memory Java DataFrame library
Stars: ✭ 50 (+212.5%)
versatile-data-kitVersatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (+800%)
jobAnalytics and searchJobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (+56.25%)
get phylomarkersA pipeline to select optimal markers for microbial phylogenomics and species tree estimation using coalescent and concatenation approaches
Stars: ✭ 34 (+112.5%)
gunpowderA library to facilitate machine learning on multi-dimensional images.
Stars: ✭ 40 (+150%)
swarmciSwarm CI - Docker Swarm-based CI system or enhancement to existing systems.
Stars: ✭ 48 (+200%)
rnafusionRNA-seq analysis pipeline for detection gene-fusions
Stars: ✭ 72 (+350%)
SynapseMLSimple and Distributed Machine Learning
Stars: ✭ 3,355 (+20868.75%)