SnowplowThe enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
trembitaModel complex data transformation pipelines easily
network-pipelineNetwork traffic data pipeline for real-time predictions and building datasets for deep neural networks
augraphyAugmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
richflowA Node.js and JavaScript synchronous data pipeline processing, data sharing and stream processing library. Actionable & Transformable Pipeline data processing.
jobAnalytics and searchJobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
ATOMAutomated Tool for Optimized Modelling
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
ob bulkstashBulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
saisokuSaisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.
datajobBuild and deploy a serverless data pipeline on AWS with no effort.
aws-pdf-textract-pipeline🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript