data-science-lab-amsterdam / skippa

Licence: other
SciKIt-learn Pipeline in PAndas

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to skippa

Pdpipe
Easy pipelines for pandas DataFrames.
Stars: ✭ 590 (+1687.88%)
Mutual labels:  pipeline, pandas-dataframe, pandas
Data Science Projects With Python
A Case Study Approach to Successful Data Science Projects Using Python, Pandas, and Scikit-Learn
Stars: ✭ 198 (+500%)
Mutual labels:  pandas-dataframe, scikit-learn, pandas
Igel
a delightful machine learning tool that allows you to train, test, and use models without writing code
Stars: ✭ 2,956 (+8857.58%)
Mutual labels:  scikit-learn, sklearn, preprocessing
Data Science Complete Tutorial
For extensive instructor led learning
Stars: ✭ 1,027 (+3012.12%)
Mutual labels:  pipeline, scikit-learn, pandas
ml-workflow-automation
Python Machine Learning (ML) project that demonstrates the archetypal ML workflow within a Jupyter notebook, with automated model deployment as a RESTful service on Kubernetes.
Stars: ✭ 44 (+33.33%)
Mutual labels:  sklearn, pandas
SeqTools
A python library to manipulate and transform indexable data (lists, arrays, ...)
Stars: ✭ 42 (+27.27%)
Mutual labels:  pipeline, preprocessing
DS-Cookbook101
A jupyter notebook having all most frequent used code snippet for daily data scienceoperations
Stars: ✭ 59 (+78.79%)
Mutual labels:  scikit-learn, pandas
Kaio-machine-learning-human-face-detection
Machine Learning project a case study focused on the interaction with digital characters, using a character called "Kaio", which, based on the automatic detection of facial expressions and classification of emotions, interacts with humans by classifying emotions and imitating expressions
Stars: ✭ 18 (-45.45%)
Mutual labels:  scikit-learn, sklearn
datascienv
datascienv is package that helps you to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries
Stars: ✭ 53 (+60.61%)
Mutual labels:  scikit-learn, pandas
five-minute-midas
Predicting Profitable Day Trading Positions using Decision Tree Classifiers. scikit-learn | Flask | SQLite3 | pandas | MLflow | Heroku | Streamlit
Stars: ✭ 41 (+24.24%)
Mutual labels:  scikit-learn, pandas
google classroom
Google Classroom Data Pipeline
Stars: ✭ 17 (-48.48%)
Mutual labels:  pipeline, pandas
xpandas
Universal 1d/2d data containers with Transformers functionality for data analysis.
Stars: ✭ 25 (-24.24%)
Mutual labels:  sklearn, pandas
cracking-the-pandas-cheat-sheet
인프런 - 단 두 장의 문서로 데이터 분석과 시각화 뽀개기
Stars: ✭ 62 (+87.88%)
Mutual labels:  pandas-dataframe, pandas
dataquest-guided-projects-solutions
My dataquest project solutions
Stars: ✭ 35 (+6.06%)
Mutual labels:  scikit-learn, pandas
introduction to ml with python
도서 "[개정판] 파이썬 라이브러리를 활용한 머신 러닝"의 주피터 노트북과 코드입니다.
Stars: ✭ 211 (+539.39%)
Mutual labels:  scikit-learn, pandas
machine-learning-capstone-project
This is the final project for the Udacity Machine Learning Nanodegree: Predicting article retweets and likes based on the title using Machine Learning
Stars: ✭ 28 (-15.15%)
Mutual labels:  scikit-learn, pandas
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-48.48%)
Mutual labels:  pipeline, preprocessing
A-Detector
⭐ An anomaly-based intrusion detection system.
Stars: ✭ 69 (+109.09%)
Mutual labels:  scikit-learn, pandas
sklearn-audio-classification
An in-depth analysis of audio classification on the RAVDESS dataset. Feature engineering, hyperparameter optimization, model evaluation, and cross-validation with a variety of ML techniques and MLP
Stars: ✭ 31 (-6.06%)
Mutual labels:  scikit-learn, sklearn
Algorithmic-Trading
I have been deeply interested in algorithmic trading and systematic trading algorithms. This Repository contains the code of what I have learnt on the way. It starts form some basic simple statistics and will lead up to complex machine learning algorithms.
Stars: ✭ 47 (+42.42%)
Mutual labels:  pandas-dataframe, pandas

pypi python versions downloads Build status Code coverage

Skippa

SciKIt-learn Pre-processing Pipeline in PAndas

Read more in the introduction blog on towardsdatascience

Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.

Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.

So basically the same idea as scikit-pandas, but a different (and hopefully better) way to achieve it.

Installation

pip install skippa

Optional, if you want to use the gradio app functionality:

pip install skippa[gradio]

Basic usage

Import Skippa class and columns helper function

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

from skippa import Skippa, columns

Get some data

df = pd.DataFrame({
    'q': [0, 0, 0],
    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
    'x': ['a', 'b', 'c'],
    'x2': ['m', 'n', 'm'],
    'y': [1, 16, 1000],
    'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])

Define your pipeline:

pipe = (
    Skippa()
        .select(columns(['x', 'x2', 'y', 'z']))
        .cast(columns(['x', 'x2']), 'category')
        .impute(columns(dtype_include='number'), strategy='median')
        .impute(columns(dtype_include='category'), strategy='most_frequent')
        .scale(columns(dtype_include='number'), type='standard')
        .onehot(columns(['x', 'x2']))
        .model(LogisticRegression())
)

and use it for fitting / predicting like this:

pipe.fit(X=df, y=y)

predictions = pipe.predict_proba(df)

If you want details on your model, use:

model = pipe.get_model()
print(model.coef_)
print(model.intercept_)

(de)serialization

And of course you can save and load your model pipelines (for deployment). N.B. dill is used for ser/de because joblib and pickle don't provide enough support.

pipe.save('./models/my_skippa_model_pipeline.dill')

...

my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)

See the ./examples directory for more examples:

To Do

  • Support pandas assign for creating new columns based on existing columns
  • Support cast / astype transformer
  • Support for .apply transformer: wrapper around pandas.DataFrame.apply
  • Check how GridSearch (or other param search) works with Skippa
  • Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
  • Support PCA transformer
  • Facilitate random seed in Skippa object that is dispatched to all downstream operations
  • fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
  • Investigate if Skippa can directly extend sklearn's Pipeline
  • Validation of pipeline steps
  • Input validation in transformers
  • OneHotEncoder: limit to maximum nr. of different values (n most frequent ones)
  • Transformer for replacing values (pandas .replace)
  • Support arbitrary transformer (if column-preserving)
  • Eliminate the need to call columns explicitly

Credits

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].