Data Engineering HowtoA list of useful resources to learn Data Engineering from scratch
Stars: ✭ 2,056 (+1769.09%)
AirflowETLBlog post on ETL pipelines with Airflow
Stars: ✭ 20 (-81.82%)
jobAnalytics and searchJobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-77.27%)
Soda SqlMetric collection, data testing and monitoring for SQL accessible data
Stars: ✭ 173 (+57.27%)
soda-sparkSoda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Stars: ✭ 58 (-47.27%)
ob bulkstashBulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
Stars: ✭ 113 (+2.73%)
Spark AlchemyCollection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Stars: ✭ 122 (+10.91%)
Applied Ml📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Stars: ✭ 17,824 (+16103.64%)
Gspread PandasA package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.
Stars: ✭ 226 (+105.45%)
YuniqlFree and open source schema versioning and database migration made natively with .NET Core.
Stars: ✭ 156 (+41.82%)
deordie-meetupsDE or DIE meetup made by data engineers for data engineers. Currently in Russian only.
Stars: ✭ 48 (-56.36%)
Gcp Data Engineer ExamStudy materials for the Google Cloud Professional Data Engineering Exam
Stars: ✭ 144 (+30.91%)
qsvCSVs sliced, diced & analyzed.
Stars: ✭ 438 (+298.18%)
ButterfreeA tool for building feature stores.
Stars: ✭ 126 (+14.55%)
Just Dashboard📊 📋 Dashboards using YAML or JSON files
Stars: ✭ 1,511 (+1273.64%)
SupersetApache Superset is a Data Visualization and Data Exploration Platform
Stars: ✭ 42,634 (+38658.18%)
contessaEasy way to define, execute and store quality rules for your data.
Stars: ✭ 17 (-84.55%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-28.18%)
awesome-dbtA curated list of awesome dbt resources
Stars: ✭ 520 (+372.73%)
Ansible PlaybookAnsible playbook to deploy distributed technologies
Stars: ✭ 61 (-44.55%)
QuiltQuilt is a self-organizing data hub for S3
Stars: ✭ 1,007 (+815.45%)
PloomberA convention over configuration workflow orchestrator. Develop locally (Jupyter or your favorite editor), deploy to Airflow or Kubernetes.
Stars: ✭ 221 (+100.91%)
saisokuSaisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.
Stars: ✭ 40 (-63.64%)
AuptimizerAn automatic ML model optimization tool.
Stars: ✭ 166 (+50.91%)
lrmrLess-Resilient MapReduce framework for Go
Stars: ✭ 32 (-70.91%)
GeniA Clojure dataframe library that runs on Spark
Stars: ✭ 152 (+38.18%)
datartDatart is a next generation Data Visualization Open Platform
Stars: ✭ 1,042 (+847.27%)
etl[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library
Stars: ✭ 279 (+153.64%)
airflow-dbt-pythonA collection of Airflow operators, hooks, and utilities to elevate dbt to a first-class citizen of Airflow.
Stars: ✭ 111 (+0.91%)
Data Science On GcpSource code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Stars: ✭ 864 (+685.45%)
AcceleratorThe Accelerator is a tool for fast and reproducible processing of large amounts of data.
Stars: ✭ 137 (+24.55%)
machine-learning-data-pipelinePipeline module for parallel real-time data processing for machine learning models development and production purposes.
Stars: ✭ 22 (-80%)
PipelinexPipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
Stars: ✭ 127 (+15.45%)
dc-sdk-js一个基于浏览器环境的数据采集SDK
Stars: ✭ 52 (-52.73%)
Aws Data WranglerPandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+2068.18%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-64.55%)
D6t PythonAccelerate data science
Stars: ✭ 118 (+7.27%)
datajobBuild and deploy a serverless data pipeline on AWS with no effort.
Stars: ✭ 101 (-8.18%)
get smartiesDummy variable generation with fit/transform capabilities
Stars: ✭ 23 (-79.09%)
aws-pdf-textract-pipeline🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
Stars: ✭ 141 (+28.18%)
SaynData processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Stars: ✭ 79 (-28.18%)
morph-kgcPowerful RDF Knowledge Graph Generation with [R2]RML Mappings
Stars: ✭ 77 (-30%)
WaimakWaimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Stars: ✭ 60 (-45.45%)
scicloj.mlA Clojure machine learning library
Stars: ✭ 152 (+38.18%)
Dbt Sqlserverdbt adapter for SQL Server and Azure SQL
Stars: ✭ 41 (-62.73%)
Everything-TechA collection of online resources to help you on your Tech journey.
Stars: ✭ 396 (+260%)
LakefsGit-like capabilities for your object storage
Stars: ✭ 847 (+670%)
Every Single Day I TldrA daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+126.36%)
polygon-etlETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (-51.82%)
prefect-saturnPython client for using Prefect Cloud with Saturn Cloud
Stars: ✭ 15 (-86.36%)
papiloDEPRECATED: Stream data processing micro-framework
Stars: ✭ 24 (-78.18%)