All Projects → petersontylerd → mlmachine

petersontylerd / mlmachine

Licence: MIT License
mlmachine accelerates machine learning experimentation

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to mlmachine

demeter
Process and analyze X-ray Absorption Spectroscopy data using Feff and either Larch or Ifeffit.
Stars: ✭ 50 (+117.39%)
Mutual labels:  data-analysis
visions
Type System for Data Analysis in Python
Stars: ✭ 136 (+491.3%)
Mutual labels:  data-analysis
r4dswebsite
Public repository for the R4DS community website.
Stars: ✭ 19 (-17.39%)
Mutual labels:  data-analysis
Data-Analyst-Nanodegree
Kai Sheng Teh - Udacity Data Analyst Nanodegree
Stars: ✭ 42 (+82.61%)
Mutual labels:  data-analysis
data-analysis-using-python
Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data
Stars: ✭ 81 (+252.17%)
Mutual labels:  data-analysis
BilibiliCrawler
🌀 crawl bilibili user info and video info for data analysis | BiliBili爬虫
Stars: ✭ 25 (+8.7%)
Mutual labels:  data-analysis
EEGEdu
Interactive Brain Playground - Browser based tutorials on EEG with webbluetooth and muse
Stars: ✭ 91 (+295.65%)
Mutual labels:  data-analysis
GreyNSights
Privacy-Preserving Data Analysis using Pandas
Stars: ✭ 18 (-21.74%)
Mutual labels:  data-analysis
Guitar
A Simple and Efficient Distributed Multidimensional BI Analysis Engine.
Stars: ✭ 86 (+273.91%)
Mutual labels:  data-analysis
growthbook
Open Source Feature Flagging and A/B Testing Platform
Stars: ✭ 2,342 (+10082.61%)
Mutual labels:  data-analysis
TextGridTools
Read, write, and manipulate Praat TextGrid files with Python
Stars: ✭ 84 (+265.22%)
Mutual labels:  data-analysis
CoreMS
CoreMS is a comprehensive mass spectrometry software framework
Stars: ✭ 20 (-13.04%)
Mutual labels:  data-analysis
data vis statistics geosciences
This repository contains the laboratory portion of an upper level undergraduate class in Python on data visualization and statistics for geo & space scientists. Labs are updated when the course is in session through the most recent branch. See master version for current class.
Stars: ✭ 32 (+39.13%)
Mutual labels:  data-analysis
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-43.48%)
Mutual labels:  data-analysis
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (+43.48%)
Mutual labels:  data-analysis
Dominando-Pandas
Este repositório está destinado ao processo de aprendizagem da biblioteca Pandas.
Stars: ✭ 22 (-4.35%)
Mutual labels:  data-analysis
rworkshops
Materials for R Workshops
Stars: ✭ 43 (+86.96%)
Mutual labels:  data-analysis
validada
Another library for defensive data analysis.
Stars: ✭ 29 (+26.09%)
Mutual labels:  data-analysis
twitter-analytics-wrapper
A simple Python wrapper to download tweets data from the Twitter Analytics platform. Particularly interesting for the impressions metrics that are unavailable on current Twitter API. Also works for the videos data.
Stars: ✭ 44 (+91.3%)
Mutual labels:  data-analysis
social-data
Code and data for eviction and housing analysis in the US
Stars: ✭ 17 (-26.09%)
Mutual labels:  data-analysis

PyPI version

mlmachine

"mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments."

Table of Contents

Novel Functionality

Easy, Elegant EDA

mlmachine creates beautiful and informative EDA panels with ease:

# create EDA panel for all "category" features
for feature in mlmachine_titanic.data.mlm_dtypes["category"]:
    mlmachine_titanic.eda_cat_target_cat_feat(
        feature=feature,
        legend_labels=["Died","Survived"],
    )

alt text

Pandas-in / Pandas-out Pipelines

mlmachine makes Scikit-learn transformers Pandas-friendly.

Here's an example. See how simply wrapping the mlmachine utility PandasTransformer() around OneHotEncoder() maintains our DataFrame:

alt text

KFold Target Encoding

mlmachine includes a utility called KFoldEncoder, which applies target encoding on categorical features and leverages out-of-fold encoding to prevent target leakage:

# perform 5-fold target encoding with TargetEncoder from the category_encoders library
encoder = KFoldEncoder(
    target=mlmachine_titanic_machine.training_target,
    cv=KFold(n_splits=5, shuffle=True, random_state=0),
    encoder=TargetEncoder,
)
encoder.fit_transform(mlmachine_titanic_machine.training_features[["Pclass"]])

alt text

Crowd-sourced Feature Importance & Exhaustive Feature Selection

mlmachine employs a robust approach to estimating feature importance by using a variety of techniques:

  • Tree-based Feature Importance
  • Recursive Feature Elimination
  • Sequential Forward Selection
  • Sequential Backward Selection
  • F-value / p-value
  • Variance 
  • Target Correlation

This occurs with one simple execution, and operates on multiple estimators and/or models, and one or more scoring metrics:

# instantiate custom models
rf2 = RandomForestClassifier(max_depth=2)
rf4 = RandomForestClassifier(max_depth=4)
rf6 = RandomForestClassifier(max_depth=6)

# estimator list - default XGBClassifier, default
# RandomForestClassifier and three custom models
estimators = [
    XGBClassifier,
    RandomForestClassifier,
    rf2,
    rf4,
    rf6,
]

# instantiate FeatureSelector object
fs = mlmachine_titanic_machine.FeatureSelector(
    data=mlmachine_titanic_machine.training_features,
    target=mlmachine_titanic_machine.training_target,
    estimators=estimators,
)

# run feature importance techniques, use ROC AUC and
# accuracy score metrics and 0 CV folds (where applicable)
feature_selector_summary = fs.feature_selector_suite(
    sequential_scoring=["roc_auc","accuracy_score"],
    sequential_n_folds=0,
    save_to_csv=True,
)

Then the features are winnowed away, from least important to most important, through an exhaustive cross-validation procedure in search of an optimum feature subset:

alt text



Hyperparameter Tuning with Bayesian Optimization

mlmachine can perform Bayesian optimization on multiple estimators in one shot, and includes functionality for visualizing model performance and parameter selections:

# generate parameter selection panels for each parameter
mlmachine_titanic_machine.model_param_plot(
        bayes_optim_summary=bayes_optim_summary,
        estimator_class="KNeighborsClassifier",
        estimator_parameter_space=estimator_parameter_space,
        n_iter=100,
    )

alt text

Example Notebooks

All examples can be viewed here

Example Notebook 1 - Learn the basics of mlmachine, how to create EDA panels, and how to execute Pandas-friendly Scikit-learn transformations and pipelines.

Example Notebook 2 - Learn how use mlmachine to assess a datasets pre-processing needs. See examples of how to use novel functionality, such as GroupbyImputer(), KFoldEncoder() and DualTransformer().

Example Notebook 3 - Learn how to perform thorough feature importance estimation, followed by an exhaustive, cross-validation-driven feature selection process.

Example Notebook 4 - Learn how to execute hyperparameter tuning with Bayesian optimization for multiple model and multiple parameter spaces in one simple execution.

Articles on Medium

mlmachine - Clean ML Experiments, Elegant EDA & Pandas Pipelines - Published 4/3/2020

mlmachine - GroupbyImputer, KFoldEncoder, and Skew Correction - Published 4/13/2020

Installation

Python Requirements: 3.6, 3.7

mlmachine uses the latest, or almost latest, versions of all dependencies. Therefore, it is highly recommended that mlmachine is installed in a virtual environment.

pyenv

Create a new virtual environment:

$ pyenv virtualenv 3.7.5 mlmachine-env

Activate your new virtual environment:

$ pyenv activate mlmachine-env

Install mlmachine using pip to install mlmachine and all dependencies:

$ pip install mlmachine

anaconda

Create a new virtual environment:

$ conda create --name mlmachine-env python=3.7

Activate your new virtual environment:

$ conda activate mlmachine-env

Install mlmachine using pip to install mlmachine and all dependencies:

$ pip install mlachine

Feedback

Any and all feedback is welcome. Please send me an email at [email protected]

Acknowledgments

mlmachine stands on the shoulders of many great Python packages:

catboost | category_encoders | eif | hyperopt | imbalanced-learn | jupyter | lightgbm | matplotlib | numpy | pandas | prettierplot | scikit-learn | scipy | seaborn | shap | statsmodels | xgboost |

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].