All Projects → BCG-Gamma → sklearndf

BCG-Gamma / sklearndf

Licence: Apache-2.0 license
DataFrame support for scikit-learn.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to sklearndf

Gspread Dataframe
Read/write Google spreadsheets using pandas DataFrames
Stars: ✭ 118 (+118.52%)
Mutual labels:  pandas-dataframe
Data Science Projects With Python
A Case Study Approach to Successful Data Science Projects Using Python, Pandas, and Scikit-Learn
Stars: ✭ 198 (+266.67%)
Mutual labels:  pandas-dataframe
yahoo-historical
Downloads historical EOD (end of day) prices from yahoo finance
Stars: ✭ 96 (+77.78%)
Mutual labels:  pandas-dataframe
Swifter
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner
Stars: ✭ 1,844 (+3314.81%)
Mutual labels:  pandas-dataframe
Influxdb Client Python
InfluxDB 2.0 python client
Stars: ✭ 165 (+205.56%)
Mutual labels:  pandas-dataframe
codefoundry
Examples for gauravbytes.com
Stars: ✭ 57 (+5.56%)
Mutual labels:  pandas-dataframe
Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (+3327.78%)
Mutual labels:  pandas-dataframe
glmnetUtils
Utilities for glmnet
Stars: ✭ 60 (+11.11%)
Mutual labels:  cross-validation
Pydbgen
Random dataframe and database table generator
Stars: ✭ 191 (+253.7%)
Mutual labels:  pandas-dataframe
3D-UNet-PyTorch-Implementation
The implementation of 3D-UNet using PyTorch
Stars: ✭ 78 (+44.44%)
Mutual labels:  cross-validation
Pyreadr
Python package to read and write R RData and Rds files into/from pandas dataframes. No R or other external dependencies required.
Stars: ✭ 137 (+153.7%)
Mutual labels:  pandas-dataframe
Py
Repository to store sample python programs for python learning
Stars: ✭ 4,154 (+7592.59%)
Mutual labels:  pandas-dataframe
numerics
library of numerical methods using Armadillo
Stars: ✭ 17 (-68.52%)
Mutual labels:  cross-validation
Rightmove webscraper.py
Python class to scrape data from rightmove.co.uk and return listings in a pandas DataFrame object
Stars: ✭ 125 (+131.48%)
Mutual labels:  pandas-dataframe
grafana-pandas-datasource
Grafana Pandas Datasource - using Python for generating timeseries-, table-data and annotations
Stars: ✭ 38 (-29.63%)
Mutual labels:  pandas-dataframe
Df2gspread
Manage Google Spreadsheets in Pandas DataFrame with Python
Stars: ✭ 114 (+111.11%)
Mutual labels:  pandas-dataframe
Sidetable
sidetable builds simple but useful summary tables of your data
Stars: ✭ 217 (+301.85%)
Mutual labels:  pandas-dataframe
fhub
Python client for Finnhub API
Stars: ✭ 31 (-42.59%)
Mutual labels:  pandas-dataframe
skrobot
skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.
Stars: ✭ 22 (-59.26%)
Mutual labels:  model-selection
cvAUC
Computationally efficient confidence intervals for cross-validated AUC estimates in R
Stars: ✭ 22 (-59.26%)
Mutual labels:  cross-validation

sphinx/source/_images/sklearndf_logo.png


pypi conda azure_build azure_code_cov python_versions code_style made_with_sphinx_doc license_badge

sklearndf is an open source library designed to address a common need with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names.

To this end, sklearndf enhances scikit-learn's estimators as follows:

  • Preserve data frame structure: Return data frames as results of transformations, preserving feature names as the column index.
  • Feature name tracing: Add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers.
  • Easy use: Simply append DF at the end of your usual scikit-learn class names to get enhanced data frame support!

The following quickstart guide provides a minimal example workflow to get up and running with sklearndf. For additional tutorials and the API reference, see the sklearndf documentation. Changes and additions to new versions are summarized in the release notes.

Installation

sklearndf supports both PyPI and Anaconda. We recommend to install sklearndf into a dedicated environment.

Anaconda

conda create -n sklearndf
conda activate sklearndf
conda install -c bcg_gamma -c conda-forge sklearndf

Pip

macOS and Linux:

python -m venv sklearndf
source sklearndf/bin/activate
pip install sklearndf

Windows:

python -m venv sklearndf
sklearndf\Scripts\activate.bat
pip install sklearndf

Quickstart

Creating a DataFrame-friendly scikit-learn preprocessing pipeline

The titanic data set includes categorical features such as class and sex, and also has missing values for numeric features (i.e., age) and categorical features (i.e., embarked). The aim is to predict whether or not a passenger survived. A standard sklearn example for this dataset can be found here.

We will build a preprocessing pipeline which:

  • for categorical variables fills missing values with the string 'Unknown' and then one-hot encodes
  • for numerical values fills missing values using median values

The strength of sklearndf is to maintain the scikit-learn conventions and expressiveness, while also preserving data frames, and hence feature names. We can see this after using fit_transform on our preprocessing pipeline.

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# relevant sklearndf imports
from sklearndf.transformation import (
    ColumnTransformerDF,
    OneHotEncoderDF,
    SimpleImputerDF,
)
from sklearndf.pipeline import (
    PipelineDF,
    ClassifierPipelineDF,
)
from sklearndf.classification import RandomForestClassifierDF

# load titanic data
titanic_X, titanic_y = fetch_openml(
    "titanic", version=1, as_frame=True, return_X_y=True
)

# select features
numerical_features = ['age', 'fare']
categorical_features = ['embarked', 'sex', 'pclass']

# create a preprocessing pipeline
preprocessing_numeric_df = SimpleImputerDF(strategy="median")

preprocessing_categorical_df = PipelineDF(
    steps=[
        ('imputer', SimpleImputerDF(strategy='constant', fill_value='Unknown')),
        ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore")),
    ]
)

preprocessing_df = ColumnTransformerDF(
    transformers=[
        ('categorical', preprocessing_categorical_df, categorical_features),
        ('numeric', preprocessing_numeric_df, numerical_features),
    ]
)

# run preprocessing
transformed_df = preprocessing_df.fit_transform(X=titanic_X, y=titanic_y)
transformed_df.head()
feature_out embarked_C embarked_Q pclass_3.0 age fare
0 0 0 0 29 211.34
1 0 0 0 0.9167 151.55
2 0 0 0 2 151.55
3 0 0 0 30 151.55
4 0 0 0 25 151.55

Tracing features from post-transform to original

The sklearndf pipeline has a feature_names_original_ attribute which returns a pandas Series, mapping the output column names (the series' index) to the input column names (the series' values). We can therefore easily select all output features generated from a given input feature, such as in this case for embarked.

embarked_type_derivatives = preprocessing_df.feature_names_original_ == "embarked"
transformed_df.loc[:, embarked_type_derivatives].head()
feature_out embarked_C embarked_Q embarked_S embarked_Unknown
0 0.0 0.0 1.0 0.0
1 0.0 0.0 1.0 0.0
2 0.0 0.0 1.0 0.0
3 0.0 0.0 1.0 0.0
4 0.0 0.0 1.0 0.0

Completing the pipeline with a classifier

Scikit-learn regressors and classifiers have a sklearndf sibling obtained by appending DF to the class name; the API of the native estimators is preserved. The result of any predict and decision function will be returned as a pandas Series (single output) or DataFrame (class probabilities or multi-output).

We can combine the preprocessing pipeline above with a classifier to create a full predictive pipeline. sklearndf provides two useful, specialised pipeline objects for this, RegressorPipelineDF and ClassifierPipelineDF. Both implement a special two-step pipeline with one preprocessing step and one prediction step, while staying compatible with the general sklearn pipeline idiom.

Using ClassifierPipelineDF we can combine the preprocessing pipeline with RandomForestClassifierDF to fit a model to a selected training set and then score on a test set.

# create full pipeline
pipeline_df = ClassifierPipelineDF(
    preprocessing=preprocessing_df,
    classifier=RandomForestClassifierDF(
        n_estimators=1000,
        max_features=2/3,
        max_depth=7,
        random_state=42,
        n_jobs=-3,
    )
)

# split data and then fit and score random forest classifier
df_train, df_test, y_train, y_test = train_test_split(
    titanic_X, titanic_y, random_state=42
)
pipeline_df.fit(df_train, y_train)
print(f"model score: {pipeline_df.score(df_test, y_test).round(2)}")

model score: 0.79

Contributing

sklearndf is stable and is being supported long-term.

Contributions to sklearndf are welcome and appreciated. For any bug reports or feature requests/enhancements please use the appropriate GitHub form, and if you wish to do so, please open a PR addressing the issue.

We do ask that for any major changes please discuss these with us first via an issue.

For further information on contributing please see our contribution guide.

License

sklearndf is licensed under Apache 2.0 as described in the LICENSE file.

Acknowledgements

Learners and pipelining from the popular Machine Learning package scikit-learn support the corresponding sklearndf implementations.

BCG GAMMA

We are always on the lookout for passionate and talented data scientists to join the BCG GAMMA team. If you would like to know more you can find out about BCG GAMMA, or have a look at career opportunities.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].