uzaymacar / exemplary-ml-pipeline

Licence: MIT license

Exemplary, annotated machine learning pipeline for any tabular data problem.

Programming Languages

Jupyter Notebook

11667 projects

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to exemplary-ml-pipeline

50-days-of-Statistics-for-Data-Science

This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.

Stars: ✭ 19 (-17.39%)

Mutual labels: feature-selection, feature-engineering, feature-scaling

FIFA-2019-Analysis

This is a project based on the FIFA World Cup 2019 and Analyzes the Performance and Efficiency of Teams, Players, Countries and other related things using Data Analysis and Data Visualizations

Stars: ✭ 28 (+21.74%)

Mutual labels: feature-selection, feature-engineering, data-cleaning

featurewiz

Use advanced feature engineering strategies and select best features from your data set with a single line of code.

Stars: ✭ 229 (+895.65%)

Mutual labels: feature-selection, feature-engineering, featuretools

Market-Mix-Modeling

Market Mix Modelling for an eCommerce firm to estimate the impact of various marketing levers on sales

Stars: ✭ 31 (+34.78%)

Mutual labels: feature-selection, feature-engineering

featuretoolsOnSpark

A simplified version of featuretools for Spark

Stars: ✭ 24 (+4.35%)

Mutual labels: feature-engineering, featuretools

Machine Learning Workflow With Python

This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation

Stars: ✭ 157 (+582.61%)

Mutual labels: feature-engineering, data-cleaning

Auto ml

[UNMAINTAINED] Automated machine learning for analytics & production

Stars: ✭ 1,559 (+6678.26%)

Mutual labels: feature-engineering, machine-learning-pipelines

skutil

NOTE: skutil is now deprecated. See its sister project: https://github.com/tgsmith61591/skoot. Original description: A set of scikit-learn and h2o extension classes (as well as caret classes for python). See more here: https://tgsmith61591.github.io/skutil

Stars: ✭ 29 (+26.09%)

Mutual labels: sklearn, h2o

feature engine

Feature engineering package with sklearn like functionality

Stars: ✭ 758 (+3195.65%)

Mutual labels: feature-selection, feature-engineering

sklearn-audio-classification

An in-depth analysis of audio classification on the RAVDESS dataset. Feature engineering, hyperparameter optimization, model evaluation, and cross-validation with a variety of ML techniques and MLP

Stars: ✭ 31 (+34.78%)

Mutual labels: sklearn, feature-engineering

Hyperparameter hunter

Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries

Stars: ✭ 648 (+2717.39%)

Mutual labels: sklearn, feature-engineering

dominance-analysis

This package can be used for dominance analysis or Shapley Value Regression for finding relative importance of predictors on given dataset. This library can be used for key driver analysis or marginal resource allocation models.

Stars: ✭ 111 (+382.61%)

Mutual labels: feature-selection, feature-engineering

NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Stars: ✭ 797 (+3365.22%)

Mutual labels: feature-selection, feature-engineering

Credit-Risk-Analysis

No description or website provided.

Stars: ✭ 29 (+26.09%)

Mutual labels: sklearn, feature-engineering

msda

Library for multi-dimensional, multi-sensor, uni/multivariate time series data analysis, unsupervised feature selection, unsupervised deep anomaly detection, and prototype of explainable AI for anomaly detector

Stars: ✭ 80 (+247.83%)

Mutual labels: feature-selection, feature-engineering

Remixautoml

R package for automation of machine learning, forecasting, feature engineering, model evaluation, model interpretation, data generation, and recommenders.

Stars: ✭ 159 (+591.3%)

Mutual labels: h2o, feature-engineering

Drugs Recommendation Using Reviews

Analyzing the Drugs Descriptions, conditions, reviews and then recommending it using Deep Learning Models, for each Health Condition of a Patient.

Stars: ✭ 35 (+52.17%)

Mutual labels: feature-engineering, data-cleaning

sklearn-feature-engineering

使用sklearn做特征工程

Stars: ✭ 114 (+395.65%)

Mutual labels: sklearn, feature-engineering

My Journey In The Data Science World

📢 Ready to learn or review your knowledge!

Stars: ✭ 1,175 (+5008.7%)

Mutual labels: sklearn, data-cleaning

skrobot

skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.

Stars: ✭ 22 (-4.35%)

Mutual labels: feature-selection, feature-engineering

View All Similar Projects ➔

Exemplary Machine Learning Pipeline

Introduction

This repository aims to act as an exemplary data science & machine learning pipeline to any tabular data problem. Moreover, the notebooks aim to explore two Python packages for machine learning automation: featuretools and h2o. Whereas featuretools specializes in feature engineering, h2ospecializes in modelling.

Follow the notebooks in the order indicated. On a broader sense, here is what we cover:

Data Insights & Visualizations
Data Cleaning
Data Imputation
Manual Feature Engineering
Automated Feature Engineering via featuretools
Feature Scaling
Feature Selection
Feature Encoding
Modelling (Model Selection & Analysis) via h2o

There are two main arguments we can make:

Currently, there is a huge gap between what we call automated machine learning and the actual machine learning workflow we have to create in order to solve a real data problem. This is a recurring theme in all notebooks as we had to try to impute missing values, apply feature selection, and much more in order to increase our prediction score.
The existing gap is based on implementations, rather than theory. In other words, there is a great literature (papers, workshops, experiments, examples, notebooks, etc.) that has evolved around the missing points in this gap. The notebooks make the appropriate references. Essentially, the hard parts are covered by packages such as h2o and featuretools, but the easier parts are not addressed in terms of automation. Notice the word automation here, otherwise sklearn already has somewhat complete implementations related to the missing points mentioned in this repository.

Getting Started

Download the data folders with prepared training and testing data files (.csv) from here and replace them with their name-wise match in this repository. Or alternatively, you can only download (0)data/ (which you can also get from here) and run the Jupyter notebooks to generate rest of the data yourself.

Results & Comparisons

All of the below models are trained and validated by h2o's H2OAutoML module, but the operations applied to the data before the modelling process differs for each row. For fairness of comparison, all models are trained under the time limit of 10000 seconds and with similar parameters.

Data Directory	Data & Operations Description	Num Features	Best Model	Maximum Prediction Accuracy (%)
`(0)data`	Untouched files extracted from Kaggle	13	Stacked Ensemble	56.19
`(1)data_manual_ops`	Applied data imputation, removed nonsensical (outlier-like) values from 'age' column, and included a new feature engineered column by linking `train_users.csv` and `age_gender_bkts.csv`	14	Stacked Ensemble	62.54
`(2)data_automated_ops`	Applied automated feature engineering via `featuretools` and by linking `train_users.csv` with `sessions.csv` and `age_gender_bkts.csv`.	137	XGBoost	68.85
`(3)data_trimmed/raw` (^)	Applied manual feature scaling based on normal distribution for numerical variables and applied a comprehensive feature selection.	39	XGBoost	71.58
`(3)data_trimmed/raw`	Same operations and data as (^), but applied undersampling to majority classes via `h2o`.	39	XGBoost	71.44
`(3)data_trimmed/label_encoded`	Same operations and data as (^), but applied label encoding to all categorical variables. Hence, all variables are numeric in the end.	39	Stacked Ensemble	72.10

Future Work

Check Driverless AI Platform.
Look into more parameters of H2OAutoML module, and particularly try increasing the value of parameter @max_runtime_secs for longer training duration and hopefully better prediction scores.
Produce more self-encoded data.

Check to see if these increase prediction scores in any way.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

uzaymacar / exemplary-ml-pipeline

Programming Languages

Labels

Projects that are alternatives of or similar to exemplary-ml-pipeline

Exemplary Machine Learning Pipeline

Introduction

Getting Started

Results & Comparisons

Future Work