Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Stars: ✭ 2,385 (-58.94%)

Mutual labels: data-science, data-engineering

Steppy

Lightweight, Python library for fast and reproducible experimentation 🔬

Stars: ✭ 119 (-97.95%)

Mutual labels: data-science, pipeline

Butterfree

A tool for building feature stores.

Stars: ✭ 126 (-97.83%)

Mutual labels: data-science, data-engineering

Dataexplorer

Automate Data Exploration and Treatment

Stars: ✭ 362 (-93.77%)

Mutual labels: data-science, eda

Batchflow

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Stars: ✭ 156 (-97.31%)

Mutual labels: data-science, pipeline

Geni

A Clojure dataframe library that runs on Spark

Stars: ✭ 152 (-97.38%)

Mutual labels: data-science, data-engineering

Learn Something Every Day

📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->

Stars: ✭ 362 (-93.77%)

Mutual labels: data-science, data-engineering

Soda Sql

Metric collection, data testing and monitoring for SQL accessible data

Stars: ✭ 173 (-97.02%)

Mutual labels: data-science, data-engineering

Accelerator

The Accelerator is a tool for fast and reproducible processing of large amounts of data.

Stars: ✭ 137 (-97.64%)

Mutual labels: data-science, data-engineering

Auptimizer

An automatic ML model optimization tool.

Stars: ✭ 166 (-97.14%)

Mutual labels: data-science, data-engineering

Lightautoml

LAMA - automatic model creation framework

Stars: ✭ 196 (-96.63%)

Mutual labels: data-science, pipeline

Gspread Pandas

A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.

Stars: ✭ 226 (-96.11%)

Mutual labels: data-science, data-engineering

Spark R Notebooks

R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 109 (-98.12%)

Mutual labels: data-science, exploratory-data-analysis

Chain.jl

A Julia package for piping a value through a series of transformation expressions using a more convenient syntax than Julia's native piping functionality.

Stars: ✭ 118 (-97.97%)

Mutual labels: data-science, pipeline

Superset

Apache Superset is a Data Visualization and Data Exploration Platform

Stars: ✭ 42,634 (+634.06%)

Mutual labels: data-science, data-engineering

Open Solution Salt Identification

Open solution to the TGS Salt Identification Challenge

Stars: ✭ 124 (-97.87%)

Mutual labels: data-science, pipeline

Spark Alchemy

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

Stars: ✭ 122 (-97.9%)

Mutual labels: data-science, data-engineering

contessa

Easy way to define, execute and store quality rules for your data.

Stars: ✭ 17 (-99.71%)

Mutual labels: data-engineering, data-quality

Blurr

Data transformations for the ML era

Stars: ✭ 96 (-98.35%)

Mutual labels: data-science, pipeline

Open Solution Toxic Comments

Open solution to the Toxic Comment Classification Challenge

Stars: ✭ 154 (-97.35%)

Mutual labels: data-science, pipeline

Bodywork Core

Deploy machine learning projects developed in Python, to Kubernetes. Accelerated MLOps 🚀

Stars: ✭ 145 (-97.5%)

Mutual labels: data-science, pipeline

Kedro

A Python framework for creating reproducible, maintainable and modular data science code.

Stars: ✭ 4,764 (-17.98%)

Mutual labels: pipeline, mlops

dqlab-career-track

A collection of scripts written to complete DQLab Data Analyst Career Track 📊

Stars: ✭ 53 (-99.09%)

Mutual labels: exploratory-data-analysis, data-quality

Sparkora

Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟

Stars: ✭ 51 (-99.12%)

Mutual labels: exploratory-data-analysis, eda

traceml

Engine for ML/Data tracking, visualization, dashboards, and model UI for Polyaxon.

Stars: ✭ 445 (-92.34%)

Mutual labels: data-profiling, mlops

Automlpipeline.jl

A package that makes it trivial to create and evaluate machine learning pipeline architectures.

Stars: ✭ 223 (-96.16%)

Mutual labels: data-science, pipeline

krsh

A declarative KubeFlow Management Tool

Stars: ✭ 127 (-97.81%)

Mutual labels: pipeline, mlops

skimpy

skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.

Stars: ✭ 236 (-95.94%)

Mutual labels: exploratory-data-analysis, eda

Exploratory Data Analysis Visualization Python

Data analysis and visualization with PyData ecosystem: Pandas, Matplotlib Numpy, and Seaborn

Stars: ✭ 78 (-98.66%)

Mutual labels: exploratory-data-analysis, eda

loon

A Toolkit for Interactive Statistical Data Visualization

Stars: ✭ 45 (-99.23%)

Mutual labels: exploratory-data-analysis, exploratory-analysis

soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Stars: ✭ 58 (-99%)

Mutual labels: data-engineering, data-quality

great expectations action

A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows.

Stars: ✭ 66 (-98.86%)

Mutual labels: data-quality, mlops

bodywork-ml-pipeline-project

Deployment template for a continuous training pipeline.

Stars: ✭ 22 (-99.62%)

Mutual labels: pipeline, mlops

versatile-data-kit

Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.

Stars: ✭ 144 (-97.52%)

Mutual labels: data-engineering, data-quality

popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎

Stars: ✭ 434 (-92.53%)

Mutual labels: data-profiling, mlops

beneath

Beneath is a serverless real-time data platform ⚡️

Stars: ✭ 65 (-98.88%)

Mutual labels: data-engineering, mlops

kedro

A Python framework for creating reproducible, maintainable and modular data science code.

Stars: ✭ 6,068 (+4.48%)

Mutual labels: pipeline, mlops

Polyaxon

Machine Learning Platform for Kubernetes (MLOps tools for experimentation and automation)

Stars: ✭ 2,966 (-48.93%)

Mutual labels: data-science, mlops

olliePy

OlliePy is a python package which can help data scientists in exploring their data and evaluating and analysing their machine learning experiments by utilising the power and structure of modern web applications. The data scientist only needs to provide the data and any required information and OlliePy will generate the rest.

Stars: ✭ 46 (-99.21%)

Mutual labels: exploratory-data-analysis, eda

Hub

Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai

Stars: ✭ 4,003 (-31.08%)

Mutual labels: data-science, mlops

Autoeda Resources

A list of software and papers related to automatic and fast Exploratory Data Analysis

Stars: ✭ 268 (-95.39%)

Mutual labels: exploratory-data-analysis, eda

Open Solution Mapping Challenge

Open solution to the Mapping Challenge 🌎

Stars: ✭ 291 (-94.99%)

Mutual labels: data-science, pipeline

Kaggle Competitions

There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.

Stars: ✭ 86 (-98.52%)

Mutual labels: data-science, exploratory-data-analysis

Drake

An R-focused pipeline toolkit for reproducibility and high-performance computing

Stars: ✭ 1,301 (-77.6%)

Mutual labels: data-science, pipeline

Ploomber

A convention over configuration workflow orchestrator. Develop locally (Jupyter or your favorite editor), deploy to Airflow or Kubernetes.

Stars: ✭ 221 (-96.19%)

Mutual labels: data-science, data-engineering

1-60 of 1500 similar projects

›

next*5