All Projects → parrt → Stratx

parrt / Stratx

Licence: mit
stratx is a library for A Stratification Approach to Partial Dependence for Codependent Variables

Projects that are alternatives of or similar to Stratx

Machine Learning
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Pandas basics
basic pandas tutorials
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Satellite imagery analysis
Implementation of different techniques to find insights from the satellite data using Python.
Stars: ✭ 31 (-11.43%)
Mutual labels:  jupyter-notebook
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Teacher Student Training
This repository stores the files used for my summer internship's work on "teacher-student learning", an experimental method for training deep neural networks using a trained teacher model.
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Welcome tutorials
Various tutorials given for welcoming new students at MILA.
Stars: ✭ 975 (+2685.71%)
Mutual labels:  jupyter-notebook
Corrosiondetector
Corrosion Detection from Images
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Mlwpy code
Code from the Pearson Addison-Wesley book Machine Learning with Python for Everyone
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Ds502 Ai Engineer
DS502-AI工程师直通车课程项目,包含每一期课程学习和实践资料,同学们可以在issue中发布自己的疑问,互相交流。
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Misc
Machine Learning / Randomized Algorithm and more
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Trainee
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Restaurant success model
A model that predicts if a restaurant is going to fail within the next 4 years.
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Kepler.gl
Kepler.gl is a powerful open source geospatial analysis tool for large-scale data sets.
Stars: ✭ 8,231 (+23417.14%)
Mutual labels:  jupyter-notebook
Minecraft Reinforcement Learning
Deep Recurrent Q-Learning vs Deep Q Learning on a simple Partially Observable Markov Decision Process with Minecraft
Stars: ✭ 33 (-5.71%)
Mutual labels:  jupyter-notebook
Healthcheck
Health Check ✔ is a Machine Learning Web Application made using Flask that can predict mainly three diseases i.e. Diabetes, Heart Disease, and Cancer.
Stars: ✭ 35 (+0%)
Mutual labels:  jupyter-notebook
Machinelearning portugues
Arquivos das aulas de Machine Leaning em Português (python + scikit-learn - https://www.youtube.com/playlist?list=PL4OAe-tL47sb3xdFBVXs2w1BA2LRN5JU2_)
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Cs231n
Convolutional Neural Networks for Visual Recognition (kNN, softmax, etc)
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook
Graph Isomorphism Networks
A Tensorflow 2.0 implementation of Graph Isomorphism Networks.
Stars: ✭ 35 (+0%)
Mutual labels:  jupyter-notebook
Presentations
Stars: ✭ 35 (+0%)
Mutual labels:  jupyter-notebook
Pyannote Core
Advanced data structures for handling temporal segments with attached labels.
Stars: ✭ 34 (-2.86%)
Mutual labels:  jupyter-notebook

Partial Dependence through Stratification

This repo provides pip package stratx for the StratPD and CatStratPD algorithms for computing model-independent partial dependence. With 0.4 release, we added feature impact and importance.

Source is Python 3 and license is MIT.

Abstracts

Partial dependence

Partial dependence curves (FPD) introduced by Friedman (2000), are an important model interpretation tool, but are often not accessible to business analysts and scientists who typically lack the skills to choose, tune, and assess machine learning models. It is also common for the same partial dependence algorithm on the same data to give meaningfully different curves for different models, which calls into question their precision. Expertise is required to distinguish between model artifacts and true relationships in the data.

In this paper, we contribute methods for computing partial dependence curves, for both numerical (StratPD) and categorical explanatory variables (CatStratPD), that work directly from training data rather than predictions of a model. Our methods provide a direct estimate of partial dependence, and rely on approximating the partial derivative of an unknown regression function without first fitting a model and then approximating its partial derivative. We investigate settings where contemporary partial dependence methods—including FPD, ALE, and SHAP methods—give biased results. Furthermore, we demonstrate that our approach works correctly on synthetic and plausibly on real data sets. Our goal is not to argue that model-based techniques are not useful. Rather, we hope to open a new line of inquiry into nonparametric partial dependence.

See the academic paper, written with James D. Wilson, but the basic problem is that the same algorithm, operating on the same data, can give meaningfully different partial dependences depending on the model chosen by the user. Consider the following comparison of Friedman's partial dependence plots and ICE plots for 3 different models to the StratPD version.

(Plots of bathrooms versus rent price using New York City apartment rent data. Sample size is 10,000 observations of ~50k. The PD/ICE plots are radically different for the same data set, depending on the chosen user model. Hyper parameters were tuned using 5-fold cross validation grid search over several hyper parameters. Keras model trained by experimentation: single hidden layer of 100 neurons, 500 epochs, batch size of 1000, batch normalization, and 30% dropout. StratPD gives a plausible roughly result that rent goes up linearly with the number of bathrooms. R2 were computed on 20% validation sets.)

Feature impact and importance

See the academic paper, written with James D. Wilson and Jeff Hamrick.

Practitioners use feature importance to rank and eliminate weak predictors during model development in an effort to simplify models and improve generality. Unfortunately, they also routinely conflate such feature importance measures with feature impact, the isolated effect of an explanatory variable on the response variable. This can lead to real-world consequences when importance is inappropriately interpreted as impact for business or medical insight purposes. The dominant approach for computing importances is through interrogation of a fitted model, which works well for feature selection, but gives distorted measures of feature impact. The same method applied to the same data set can yield different feature importances, depending on the model, leading us to conclude that impact should be computed directly from the data. While there are nonparametric feature selection algorithms, they typically provide feature rankings, rather than measures of impact or importance. They also typically focus on single-variable associations with the response. In this paper, we give mathematical definitions of feature impact and importance, derived from partial dependence curves, that operate directly on the data. To assess quality, we show that features ranked by these definitions are competitive with existing feature selection techniques using three real data sets for predictive tasks.

Sorry. I don't have examples set up yet for feature impact except for those images needed for the paper. Here are the top k features from a NYC rent data set training an RF model and comparing the mean absolute error:

Installation

Latest is 0.4.

pip install -U stratx

Examples

See notebooks/examples.ipynb for lots more stuff.

Boston

from sklearn.datasets import load_boston
from stratx.partdep import *

boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target

X = df.drop('MEDV', axis=1)
y = df['MEDV']

# WORKS ONLY WITH DATAFRAMES AT MOMENT
plt.figure(figsize=(3.5,3.5))
plot_stratpd(X, y, 'LSTAT', 'MEDV', yrange=(-20, 5), n_trials=10)
plt.tight_layout()
plt.show()

Diabetes

from sklearn.datasets import load_diabetes
from stratx.partdep import *

diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['y'] = diabetes.target
X = df.drop('y', axis=1)
y = df['y']
plt.figure(figsize=(3.5,3.5))
plot_stratpd(X, y, 'bmi', 'y', n_trials=10)
plt.tight_layout()
plt.show()

diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['sex'] = np.where(df['sex']<0, 0, 1)
df['y'] = diabetes.target
X = df.drop('y', axis=1)
y = df['y']

plt.figure(figsize=(3.5,3.5))
plot_catstratpd(X, y, 'sex', 'y',
                show_x_counts=False,
                n_trials=10,
                min_y_shifted_to_zero=True,
    catnames=['female','male']) # not sure which is male/female actually!
plt.tight_layout()
plt.show()     

Comparing to PDP/ICE, SHAP, ALE, and StratPD plots

Plots of height versus weight using synthetic data body weight data (see academic paper). The PD/ICE on the left is biased by codependent features since pregnant women, who are typically shorter than men, have a jump in weight.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].