basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+47.06%)
lineageGenerate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-5.88%)
DatavecETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (+1500%)
Bulk WriterProvides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (+1135.29%)
machine-learning-data-pipelinePipeline module for parallel real-time data processing for machine learning models development and production purposes.
Stars: ✭ 22 (+29.41%)
mydataharbor🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步,主要定位是为实时交易系统服务,亦可用于大数据的数据同步(ETL领域)。
Stars: ✭ 28 (+64.71%)
Go StreamsA lightweight stream processing library for Go
Stars: ✭ 615 (+3517.65%)
prostoProsto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
Stars: ✭ 54 (+217.65%)
dropEstPipeline for initial analysis of droplet-based single-cell RNA-seq data
Stars: ✭ 71 (+317.65%)
StetlStetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (+276.47%)
ForteForte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 89 (+423.53%)
AirbyteAirbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+28835.29%)
Metlmito ETL tool
Stars: ✭ 153 (+800%)
python mozetlETL jobs for Firefox Telemetry
Stars: ✭ 25 (+47.06%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+3623.53%)
skippaSciKIt-learn Pipeline in PAndas
Stars: ✭ 33 (+94.12%)
etl[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library
Stars: ✭ 279 (+1541.18%)
naas⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (+1188.24%)
ButterfreeA tool for building feature stores.
Stars: ✭ 126 (+641.18%)
etlM-Lab ingestion pipeline
Stars: ✭ 15 (-11.76%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+364.71%)
Morphl Community EditionMorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (+1388.24%)
Mara PipelinesA lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Stars: ✭ 1,841 (+10729.41%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+129.41%)
SeqToolsA python library to manipulate and transform indexable data (lists, arrays, ...)
Stars: ✭ 42 (+147.06%)
dflibIn-memory Java DataFrame library
Stars: ✭ 50 (+194.12%)
tracemlEngine for ML/Data tracking, visualization, dashboards, and model UI for Polyaxon.
Stars: ✭ 445 (+2517.65%)
jobAnalytics and searchJobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (+47.06%)
NGI-RNAseqNextflow RNA-Seq Best Practice analysis pipeline, used at the SciLifeLab National Genomics Infrastructure.
Stars: ✭ 50 (+194.12%)
versatile-data-kitVersatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (+747.06%)
get phylomarkersA pipeline to select optimal markers for microbial phylogenomics and species tree estimation using coalescent and concatenation approaches
Stars: ✭ 34 (+100%)
emg-viral-pipelineVIRify: detection of phages and eukaryotic viruses from metagenomic and metatranscriptomic assemblies
Stars: ✭ 38 (+123.53%)
gunpowderA library to facilitate machine learning on multi-dimensional images.
Stars: ✭ 40 (+135.29%)
stargateAn Apache Pulsar client written in Elixir
Stars: ✭ 33 (+94.12%)
bacannotGeneric but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.
Stars: ✭ 51 (+200%)
cqClojure Command-line Data Processor for JSON, YAML, EDN, XML and more
Stars: ✭ 111 (+552.94%)
SynapseMLSimple and Distributed Machine Learning
Stars: ✭ 3,355 (+19635.29%)
kubecryptHelper for dealing with secrets in kubernetes.
Stars: ✭ 23 (+35.29%)
biojupiesAutomated generation of tailored bioinformatics Jupyter Notebooks via a user interface.
Stars: ✭ 96 (+464.71%)
golang-docker-exampleAn example of how to run a Golang project in Docker in a Buildkite pipeline
Stars: ✭ 18 (+5.88%)
Speech-RecognitionEnd-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
Stars: ✭ 21 (+23.53%)
bump-everywhere🚀 Automate versioning, changelog creation, README updates and GitHub releases using GitHub Actions,npm, docker or bash.
Stars: ✭ 24 (+41.18%)
howtheydevopsA curated collection of publicly available resources on how companies around the world practice DevOps
Stars: ✭ 318 (+1770.59%)
bonobo-sqlalchemyPREVIEW - SQL databases in Bonobo, using sqlalchemy
Stars: ✭ 23 (+35.29%)
MLLabelUtils.jlUtility package for working with classification targets and label-encodings
Stars: ✭ 30 (+76.47%)
rnafusionRNA-seq analysis pipeline for detection gene-fusions
Stars: ✭ 72 (+323.53%)
persistityA persistence framework for game developers
Stars: ✭ 34 (+100%)
pipe-traitMake it possible to chain regular functions
Stars: ✭ 22 (+29.41%)
PDAP-ScrapersCode relating to scraping public police data.
Stars: ✭ 72 (+323.53%)