All Projects → aerdem4 → Lofo Importance

aerdem4 / Lofo Importance

Licence: mit
Leave One Feature Out Importance

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lofo Importance

Data visualization
A collection of my data visualizations, mostly in Python.
Stars: ✭ 294 (-5.16%)
Mutual labels:  data-science
Datascience Anthology Pydata
PyData, The Complete Works of
Stars: ✭ 301 (-2.9%)
Mutual labels:  data-science
Ai Learn
人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域
Stars: ✭ 4,387 (+1315.16%)
Mutual labels:  data-science
Preql
An interpreted relational query language that compiles to SQL.
Stars: ✭ 257 (-17.1%)
Mutual labels:  data-science
Learning
Becoming better at data science every day
Stars: ✭ 4,659 (+1402.9%)
Mutual labels:  data-science
Python Seminar
Python for Data Science (Seminar Course at UC Berkeley; AY 250)
Stars: ✭ 302 (-2.58%)
Mutual labels:  data-science
Dagster
An orchestration platform for the development, production, and observation of data assets.
Stars: ✭ 4,099 (+1222.26%)
Mutual labels:  data-science
Apricot
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html
Stars: ✭ 306 (-1.29%)
Mutual labels:  data-science
Scikit Learn Videos
Jupyter notebooks from the scikit-learn video series
Stars: ✭ 3,254 (+949.68%)
Mutual labels:  data-science
Cartola
Extração de dados da API do CartolaFC, análise exploratória dos dados e modelos preditivos em R e Python - 2014-20. [EN] Data munging, analysis and modeling of CartolaFC - the most popular fantasy football game in Brazil and maybe in the world. Data cover years 2014-19.
Stars: ✭ 304 (-1.94%)
Mutual labels:  data-science
Tensorwatch
Debugging, monitoring and visualization for Python Machine Learning and Data Science
Stars: ✭ 3,191 (+929.35%)
Mutual labels:  data-science
Targets
Function-oriented Make-like declarative workflows for R
Stars: ✭ 293 (-5.48%)
Mutual labels:  data-science
Datasets
A repository of pretty cool datasets that I collected for network science and machine learning research.
Stars: ✭ 302 (-2.58%)
Mutual labels:  data-science
Sklearn Evaluation
Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.
Stars: ✭ 294 (-5.16%)
Mutual labels:  data-science
Xam
🎯 Personal data science and machine learning toolbox
Stars: ✭ 306 (-1.29%)
Mutual labels:  data-science
Autogluon
AutoGluon: AutoML for Text, Image, and Tabular Data
Stars: ✭ 3,920 (+1164.52%)
Mutual labels:  data-science
Pydataroad
open source for wechat-official-account (ID: PyDataLab)
Stars: ✭ 302 (-2.58%)
Mutual labels:  data-science
Erlemar.github.io
Data science portfolio
Stars: ✭ 309 (-0.32%)
Mutual labels:  data-science
Elixir Scrape
Scrape any website, article or RSS/Atom Feed with ease!
Stars: ✭ 306 (-1.29%)
Mutual labels:  data-science
120 Ds Interview Questions
My Answer to 120 Data Science Interview Questions
Stars: ✭ 304 (-1.94%)
Mutual labels:  data-science

alt text

LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric.

LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set. The mean and standard deviation (across the folds) of the importance of each feature is then reported.

If a model is not passed as an argument to LOFO Importance, it will run LightGBM as a default model.

Install

LOFO Importance can be installed using

pip install lofo-importance

Advantages of LOFO Importance

LOFO has several advantages compared to other importance types:

  • It does not favor granular features
  • It generalises well to unseen test sets
  • It is model agnostic
  • It gives negative importance to features that hurt performance upon inclusion
  • It can group the features. Especially useful for high dimensional features like TFIDF or OHE features.

Example on Kaggle's Microsoft Malware Prediction Competition

In this Kaggle competition, Microsoft provides a malware dataset to predict whether or not a machine will soon be hit with malware. One of the features, Centos_OSVersion is very predictive on the training set, since some OS versions are probably more prone to bugs and failures than others. However, upon splitting the data out of time, we obtain validation sets with OS versions that have not occurred in the training set. Therefore, the model will not have learned the relationship between the target and this seasonal feature. By evaluating this feature's importance using other importance types, Centos_OSVersion seems to have high importance, because its importance was evaluated using only the training set. However, LOFO Importance depends on a validation scheme, so it will not only give this feature low importance, but even negative importance.

import pandas as pd
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
%matplotlib inline

# import data
train_df = pd.read_csv("../input/train.csv", dtype=dtypes)

# extract a sample of the data
sample_df = train_df.sample(frac=0.01, random_state=0)
sample_df.sort_values("AvSigVersion", inplace=True)

# define the validation scheme
cv = KFold(n_splits=4, shuffle=False, random_state=0)

# define the binary target and the features
dataset = Dataset(df=sample_df, target="HasDetections", features=[col for col in train_df.columns if col != target])

# define the validation scheme and scorer. The default model is LightGBM
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="roc_auc")

# get the mean and standard deviation of the importances in pandas format
importance_df = lofo_imp.get_importance()

# plot the means and standard deviations of the importances
plot_importance(importance_df, figsize=(12, 20))

alt text

Another Example: Kaggle's TReNDS Competition

In this Kaggle competition, pariticipants are asked to predict some cognitive properties of patients. Independent component features (IC) from sMRI and very high dimensional correlation features (FNC) from 3D fMRIs are provided. LOFO can group the fMRI correlation features into one.

def get_lofo_importance(target):
    cv = KFold(n_splits=7, shuffle=True, random_state=17)

    dataset = Dataset(df=df[df[target].notnull()], target=target, features=loading_features,
                      feature_groups={"fnc": df[df[target].notnull()][fnc_features].values
                      })

    model = Ridge(alpha=0.01)
    lofo_imp = LOFOImportance(dataset, cv=cv, scoring="neg_mean_absolute_error", model=model)

    return lofo_imp.get_importance()

plot_importance(get_lofo_importance(target="domain1_var1"), figsize=(8, 8), kind="box")

alt text

Flofo Importance

If running the LOFO Importance package is too time-costly for you, you can use Fast LOFO. Fast LOFO, or FLOFO takes, as inputs, an already trained model and a validation set, and does a pseudo-random permutation on the values of each feature, one by one, then uses the trained model to make predictions on the validation set. The mean of the FLOFO importance is then the difference in the performance of the model on the validation set over several randomised permutations. The difference between FLOFO importance and permutation importance is that the permutations on a feature's values are done within groups, where groups are obtained by grouping the validation set by k=2 features. These k features are chosen at random n=10 times, and the mean and standard deviation of the FLOFO importance are calculated based on these n runs. The reason this grouping makes the measure of importance better is that permuting a feature's value is no longer completely random. In fact, the permutations are done within groups of similar samples, so the permutations are equivalent to noising the samples. This ensures that:

  • The permuted feature values are very unlikely to be replaced by unrealistic values.
  • A feature that is predictable by features among the chosen n*k features will be replaced by very similar values during permutation. Therefore, it will only slightly affect the model performance (and will yield a small FLOFO importance). This solves the correlated feature overestimation problem.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].