Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Stars: ✭ 25 (-96.05%)

Mutual labels: spark, etl, pyspark

Spark Alchemy

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

Stars: ✭ 122 (-80.73%)

Mutual labels: data-science, spark, data-engineering

Aws Data Wrangler

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Stars: ✭ 2,385 (+276.78%)

Mutual labels: data-science, etl, data-engineering

Cql

Categorical Query Language IDE

Stars: ✭ 196 (-69.04%)

Mutual labels: data-science, etl

Gspread Pandas

A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.

Stars: ✭ 226 (-64.3%)

Mutual labels: data-science, data-engineering

Mydatascienceportfolio

Applying Data Science and Machine Learning to Solve Real World Business Problems

Stars: ✭ 227 (-64.14%)

Mutual labels: data-science, spark

hive-metastore-client

A client for connecting and running DDLs on hive metastore.

Stars: ✭ 37 (-94.15%)

Mutual labels: etl, data-engineering

Ploomber

A convention over configuration workflow orchestrator. Develop locally (Jupyter or your favorite editor), deploy to Airflow or Kubernetes.

Stars: ✭ 221 (-65.09%)

Mutual labels: data-science, data-engineering

etl

[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library

Stars: ✭ 279 (-55.92%)

Mutual labels: etl, data-engineering

morph-kgc

Powerful RDF Knowledge Graph Generation with [R2]RML Mappings

Stars: ✭ 77 (-87.84%)

Mutual labels: etl, data-engineering

python mozetl

ETL jobs for Firefox Telemetry

Stars: ✭ 25 (-96.05%)

Mutual labels: etl, pyspark

hamilton

A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Stars: ✭ 612 (-3.32%)

Mutual labels: etl, data-engineering

gallia-core

A schema-aware Scala library for data transformation

Stars: ✭ 44 (-93.05%)

Mutual labels: etl, data-engineering

lineage

Generate beautiful documentation for your data pipelines in markdown format

Stars: ✭ 16 (-97.47%)

Mutual labels: etl, pyspark

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-93.84%)

Mutual labels: etl, pyspark

jobAnalytics and search

JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.

Stars: ✭ 25 (-96.05%)

Mutual labels: pyspark, data-engineering

ODSC India 2018

My presentation at ODSC India 2018 about Deep Learning with Apache Spark

Stars: ✭ 26 (-95.89%)

Mutual labels: spark, pyspark

pangeo-forge-recipes

Python library for building Pangeo Forge recipes.

Stars: ✭ 64 (-89.89%)

Mutual labels: etl, data-engineering

Soda Sql

Metric collection, data testing and monitoring for SQL accessible data

Stars: ✭ 173 (-72.67%)

Mutual labels: data-science, data-engineering

Elastic

R client for the Elasticsearch HTTP API

Stars: ✭ 227 (-64.14%)

Mutual labels: data-science, etl

Auptimizer

An automatic ML model optimization tool.

Stars: ✭ 166 (-73.78%)

Mutual labels: data-science, data-engineering

AirflowETL

Blog post on ETL pipelines with Airflow

Stars: ✭ 20 (-96.84%)

Mutual labels: etl, data-engineering

Koalas

Koalas: pandas API on Apache Spark

Stars: ✭ 3,044 (+380.88%)

Mutual labels: data-science, spark

soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Stars: ✭ 58 (-90.84%)

Mutual labels: pyspark, data-engineering

Scalable Data Science Platform

Content for architecting a data science platform for products using Luigi, Spark & Flask.

Stars: ✭ 158 (-75.04%)

Mutual labels: data-science, spark

uptasticsearch

An Elasticsearch client tailored to data science workflows.

Stars: ✭ 47 (-92.58%)

Mutual labels: etl, data-engineering

blockchain-etl-streaming

Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes

Stars: ✭ 57 (-91%)

Mutual labels: etl, data-engineering

versatile-data-kit

Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.

Stars: ✭ 144 (-77.25%)

Mutual labels: etl, data-engineering

polygon-etl

ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub

Stars: ✭ 53 (-91.63%)

Mutual labels: etl, data-engineering

DataEngineering

This repo contains commands that data engineers use in day to day work.

Stars: ✭ 47 (-92.58%)

Mutual labels: pyspark, data-engineering

sparklanes

A lightweight data processing framework for Apache Spark

Stars: ✭ 17 (-97.31%)

Mutual labels: etl, pyspark

data processing course

Some class materials for a data processing course using PySpark

Stars: ✭ 50 (-92.1%)

Mutual labels: spark, pyspark

spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

Stars: ✭ 25 (-96.05%)

Mutual labels: spark, pyspark

beneath

Beneath is a serverless real-time data platform ⚡️

Stars: ✭ 65 (-89.73%)

Mutual labels: etl, data-engineering

arthur-redshift-etl

ELT Code for your Data Warehouse

Stars: ✭ 22 (-96.52%)

Mutual labels: etl, data-engineering

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (-82.46%)

Mutual labels: spark, pyspark

data-algorithms-with-spark

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

Stars: ✭ 34 (-94.63%)

Mutual labels: spark, pyspark

AirflowDataPipeline

Example of an ETL Pipeline using Airflow

Stars: ✭ 24 (-96.21%)

Mutual labels: etl, data-engineering

Around Dataengineering

A Data Engineering & Machine Learning Knowledge Hub

Stars: ✭ 257 (-59.4%)

Mutual labels: spark, data-engineering

Sk Dist

Distributed scikit-learn meta-estimators in PySpark

Stars: ✭ 260 (-58.93%)

Mutual labels: data-science, spark

Datavec

ETL Library for Machine Learning - data pipelines, data munging and wrangling

Stars: ✭ 272 (-57.03%)

Mutual labels: spark, etl

Spark Notebook

Interactive and Reactive Data Science using Scala and Spark.

Stars: ✭ 3,081 (+386.73%)

Mutual labels: data-science, spark

incubator-linkis

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

Stars: ✭ 2,459 (+288.47%)

Mutual labels: spark, pyspark

etl manager

A python package to create a database on the platform using our moj data warehousing framework

Stars: ✭ 14 (-97.79%)

Mutual labels: etl, data-engineering

Benthos

Fancy stream processing made operationally mundane

Stars: ✭ 3,705 (+485.31%)

Mutual labels: etl, data-engineering

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+793.52%)

Mutual labels: data-science, spark

Metorikku

A simplified, lightweight ETL Framework based on Apache Spark

Stars: ✭ 361 (-42.97%)

Mutual labels: spark, etl

Learn Something Every Day

📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->

Stars: ✭ 362 (-42.81%)

Mutual labels: data-science, data-engineering

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (-41.23%)

Mutual labels: spark, etl

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (-35.86%)

Mutual labels: spark, pyspark

Dataform

Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift

Stars: ✭ 342 (-45.97%)

Mutual labels: etl, data-engineering

1-60 of 1517 similar projects

›

next*5