All Projects → uzaymacar → exemplary-ml-pipeline

uzaymacar / exemplary-ml-pipeline

Licence: MIT license
Exemplary, annotated machine learning pipeline for any tabular data problem.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to exemplary-ml-pipeline

50-days-of-Statistics-for-Data-Science
This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.
Stars: ✭ 19 (-17.39%)
Mutual labels:  feature-selection, feature-engineering, feature-scaling
FIFA-2019-Analysis
This is a project based on the FIFA World Cup 2019 and Analyzes the Performance and Efficiency of Teams, Players, Countries and other related things using Data Analysis and Data Visualizations
Stars: ✭ 28 (+21.74%)
Mutual labels:  feature-selection, feature-engineering, data-cleaning
featurewiz
Use advanced feature engineering strategies and select best features from your data set with a single line of code.
Stars: ✭ 229 (+895.65%)
Mutual labels:  feature-selection, feature-engineering, featuretools
Market-Mix-Modeling
Market Mix Modelling for an eCommerce firm to estimate the impact of various marketing levers on sales
Stars: ✭ 31 (+34.78%)
Mutual labels:  feature-selection, feature-engineering
featuretoolsOnSpark
A simplified version of featuretools for Spark
Stars: ✭ 24 (+4.35%)
Mutual labels:  feature-engineering, featuretools
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (+582.61%)
Mutual labels:  feature-engineering, data-cleaning
Auto ml
[UNMAINTAINED] Automated machine learning for analytics & production
Stars: ✭ 1,559 (+6678.26%)
Mutual labels:  feature-engineering, machine-learning-pipelines
skutil
NOTE: skutil is now deprecated. See its sister project: https://github.com/tgsmith61591/skoot. Original description: A set of scikit-learn and h2o extension classes (as well as caret classes for python). See more here: https://tgsmith61591.github.io/skutil
Stars: ✭ 29 (+26.09%)
Mutual labels:  sklearn, h2o
feature engine
Feature engineering package with sklearn like functionality
Stars: ✭ 758 (+3195.65%)
Mutual labels:  feature-selection, feature-engineering
sklearn-audio-classification
An in-depth analysis of audio classification on the RAVDESS dataset. Feature engineering, hyperparameter optimization, model evaluation, and cross-validation with a variety of ML techniques and MLP
Stars: ✭ 31 (+34.78%)
Mutual labels:  sklearn, feature-engineering
Hyperparameter hunter
Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
Stars: ✭ 648 (+2717.39%)
Mutual labels:  sklearn, feature-engineering
dominance-analysis
This package can be used for dominance analysis or Shapley Value Regression for finding relative importance of predictors on given dataset. This library can be used for key driver analysis or marginal resource allocation models.
Stars: ✭ 111 (+382.61%)
Mutual labels:  feature-selection, feature-engineering
NVTabular
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Stars: ✭ 797 (+3365.22%)
Mutual labels:  feature-selection, feature-engineering
Credit-Risk-Analysis
No description or website provided.
Stars: ✭ 29 (+26.09%)
Mutual labels:  sklearn, feature-engineering
msda
Library for multi-dimensional, multi-sensor, uni/multivariate time series data analysis, unsupervised feature selection, unsupervised deep anomaly detection, and prototype of explainable AI for anomaly detector
Stars: ✭ 80 (+247.83%)
Mutual labels:  feature-selection, feature-engineering
Remixautoml
R package for automation of machine learning, forecasting, feature engineering, model evaluation, model interpretation, data generation, and recommenders.
Stars: ✭ 159 (+591.3%)
Mutual labels:  h2o, feature-engineering
Drugs Recommendation Using Reviews
Analyzing the Drugs Descriptions, conditions, reviews and then recommending it using Deep Learning Models, for each Health Condition of a Patient.
Stars: ✭ 35 (+52.17%)
Mutual labels:  feature-engineering, data-cleaning
sklearn-feature-engineering
使用sklearn做特征工程
Stars: ✭ 114 (+395.65%)
Mutual labels:  sklearn, feature-engineering
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+5008.7%)
Mutual labels:  sklearn, data-cleaning
skrobot
skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.
Stars: ✭ 22 (-4.35%)
Mutual labels:  feature-selection, feature-engineering

Exemplary Machine Learning Pipeline

Introduction

This repository aims to act as an exemplary data science & machine learning pipeline to any tabular data problem. Moreover, the notebooks aim to explore two Python packages for machine learning automation: featuretools and h2o. Whereas featuretools specializes in feature engineering, h2ospecializes in modelling.

Follow the notebooks in the order indicated. On a broader sense, here is what we cover:

  • Data Insights & Visualizations
  • Data Cleaning
  • Data Imputation
  • Manual Feature Engineering
  • Automated Feature Engineering via featuretools
  • Feature Scaling
  • Feature Selection
  • Feature Encoding
  • Modelling (Model Selection & Analysis) via h2o

There are two main arguments we can make:

  1. Currently, there is a huge gap between what we call automated machine learning and the actual machine learning workflow we have to create in order to solve a real data problem. This is a recurring theme in all notebooks as we had to try to impute missing values, apply feature selection, and much more in order to increase our prediction score.
  2. The existing gap is based on implementations, rather than theory. In other words, there is a great literature (papers, workshops, experiments, examples, notebooks, etc.) that has evolved around the missing points in this gap. The notebooks make the appropriate references. Essentially, the hard parts are covered by packages such as h2o and featuretools, but the easier parts are not addressed in terms of automation. Notice the word automation here, otherwise sklearn already has somewhat complete implementations related to the missing points mentioned in this repository.

Getting Started

Download the data folders with prepared training and testing data files (.csv) from here and replace them with their name-wise match in this repository. Or alternatively, you can only download (0)data/ (which you can also get from here) and run the Jupyter notebooks to generate rest of the data yourself.

Results & Comparisons

All of the below models are trained and validated by h2o's H2OAutoML module, but the operations applied to the data before the modelling process differs for each row. For fairness of comparison, all models are trained under the time limit of 10000 seconds and with similar parameters.

Data Directory Data & Operations Description Num Features Best Model Maximum Prediction Accuracy (%)
(0)data Untouched files extracted from Kaggle 13 Stacked Ensemble 56.19
(1)data_manual_ops Applied data imputation, removed nonsensical (outlier-like) values from 'age' column, and included a new feature engineered column by linking train_users.csv and age_gender_bkts.csv 14 Stacked Ensemble 62.54
(2)data_automated_ops Applied automated feature engineering via featuretools and by linking train_users.csv with sessions.csv and age_gender_bkts.csv. 137 XGBoost 68.85
(3)data_trimmed/raw (^) Applied manual feature scaling based on normal distribution for numerical variables and applied a comprehensive feature selection. 39 XGBoost 71.58
(3)data_trimmed/raw Same operations and data as (^), but applied undersampling to majority classes via h2o. 39 XGBoost 71.44
(3)data_trimmed/label_encoded Same operations and data as (^), but applied label encoding to all categorical variables. Hence, all variables are numeric in the end. 39 Stacked Ensemble 72.10

Future Work

  • Check Driverless AI Platform.
  • Look into more parameters of H2OAutoML module, and particularly try increasing the value of parameter @max_runtime_secs for longer training duration and hopefully better prediction scores.
  • Produce more self-encoded data.

Check to see if these increase prediction scores in any way.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].