A Node.js and JavaScript synchronous data pipeline processing, data sharing and stream processing library. Actionable & Transformable Pipeline data processing.

✭ 17

javascript nodejs data-stream data-transformation pipeline-framework data-flow synchronous data-pipeline streaming-data data-processor pipe-data

rivery cli

Rivery CLI

✭ 16

python data-science database etl dataops database-management dwh elt data-pipelines data-pipeline dwh-team dataops-platform rivery

jobAnalytics and search

JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.

✭ 25

python TSQL shell aws airflow sql spark analytics s3 jobs pyspark data-engineering data-lake redshift jobseeker jobsearch data-modeling data-pipeline jobscheduler

ATOM

Automated Tool for Optimized Modelling

✭ 85

Jupyter Notebook python machine-learning modelling data-pipeline

opentrials-airflow

Configuration and definitions of Airflow for OpenTrials

✭ 18

python shell Makefile airflow data-pipeline opentrials

practical-data-engineering

Real estate dagster pipeline

✭ 110

Jupyter Notebook python data-engineering data-pipeline dagster

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

✭ 39

python big-data spark apache-spark hadoop etl xml xml-parsing pyspark data-pipeline datalake hadoop-mapreduce spark-sql etl-framework hadoop-hdfs etl-pipeline etl-components

machine-learning-data-pipeline

Pipeline module for parallel real-time data processing for machine learning models development and production purposes.

✭ 22

data-science machine-learning natural-language-processing deep-learning algorithms parallel data-preprocessing data-processing computing data-preparation data-pipeline

ob bulkstash

Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.

✭ 113

shell Dockerfile docker sync docker-image s3 google-cloud-storage google-cloud rclone docker-service amazon-web-services data-pipeline oracle-cloud storage-service sftp-synchronisation docker-rclone

saisoku

Saisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.

✭ 40

python sync pipeline scheduling s3 orchestration-framework rclone luigi tornado data-transfer file-transfer transfer-files data-synchronization data-pipeline sync-directories luigi-pipeline transfer-server directory-transfer

dc-sdk-js

一个基于浏览器环境的数据采集SDK

✭ 52

javascript typescript HTML CSS Vue data-analysis web-sdk data-pipeline collect-plugin

datajob

Build and deploy a serverless data pipeline on AWS with no effort.

✭ 101

python Dockerfile Makefile aws machine-learning serverless pipeline glue data-pipeline stepfunctions sagemaker aws-cdk glue-job

AirflowETL

Blog post on ETL pipelines with Airflow

✭ 20

Jupyter Notebook airflow sql database schedule etl postgresql data-engineering data-pipeline etl-pipeline

aws-pdf-textract-pipeline

🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

✭ 141

typescript javascript pdf aws lambda cloudformation serverless jest dynamodb s3 sns webscraping textract data-pipeline cdk puppeteer aws-cdk aws-textract

scicloj.ml

A Clojure machine learning library

✭ 152

clojure nlp data-science machine-learning clustering regression hyperparameter-optimization classification data-pipeline experiment-tracking scicloj

1-23 of 23 data-pipeline projects