All Projects → neuro-ml → reskit

neuro-ml / reskit

Licence: BSD-3-Clause license
A library for creating and curating reproducible pipelines for scientific and industrial machine learning

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to reskit

open-solution-googleai-object-detection
Open solution to the Google AI Object Detection Challenge 🍁
Stars: ✭ 46 (+70.37%)
Mutual labels:  pipeline, reproducible-research, reproducible-experiments
Neuraxle
A Sklearn-like Framework for Hyperparameter Tuning and AutoML in Deep Learning projects. Finally have the right abstractions and design patterns to properly do AutoML. Let your pipeline steps have hyperparameter spaces. Enable checkpoints to cut duplicate calculations. Go from research to production environment easily.
Stars: ✭ 377 (+1296.3%)
Mutual labels:  pipeline, scikit-learn
Targets
Function-oriented Make-like declarative workflows for R
Stars: ✭ 293 (+985.19%)
Mutual labels:  pipeline, reproducible-research
Drake Examples
Example workflows for the drake R package
Stars: ✭ 57 (+111.11%)
Mutual labels:  pipeline, reproducible-research
targets-minimal
A minimal example data analysis project with the targets R package
Stars: ✭ 50 (+85.19%)
Mutual labels:  pipeline, reproducible-research
bacannot
Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.
Stars: ✭ 51 (+88.89%)
Mutual labels:  pipeline, reproducible-research
Data Science Complete Tutorial
For extensive instructor led learning
Stars: ✭ 1,027 (+3703.7%)
Mutual labels:  pipeline, scikit-learn
Steppy Toolkit
Curated set of transformers that make your work with steppy faster and more effective 🔭
Stars: ✭ 21 (-22.22%)
Mutual labels:  pipeline, reproducible-research
Nextflow
A DSL for data-driven computational pipelines
Stars: ✭ 1,337 (+4851.85%)
Mutual labels:  pipeline, reproducible-research
Steppy
Lightweight, Python library for fast and reproducible experimentation 🔬
Stars: ✭ 119 (+340.74%)
Mutual labels:  pipeline, reproducible-research
Sarek
Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
Stars: ✭ 124 (+359.26%)
Mutual labels:  pipeline, reproducible-research
Machine-learning-toolkits-with-python
Machine learning toolkits with Python
Stars: ✭ 31 (+14.81%)
Mutual labels:  scikit-learn, grid-search
openscience
Empirical Software Engineering journal (EMSE) open science and reproducible research initiative
Stars: ✭ 28 (+3.7%)
Mutual labels:  reproducible-research, reproducible-experiments
skippa
SciKIt-learn Pipeline in PAndas
Stars: ✭ 33 (+22.22%)
Mutual labels:  pipeline, scikit-learn
bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Stars: ✭ 120 (+344.44%)
Mutual labels:  data-preparation, prepare-data
Drake
An R-focused pipeline toolkit for reproducibility and high-performance computing
Stars: ✭ 1,301 (+4718.52%)
Mutual labels:  pipeline, reproducible-research
targets-tutorial
Short course on the targets R package
Stars: ✭ 87 (+222.22%)
Mutual labels:  pipeline, reproducible-research
ngs-preprocess
A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies
Stars: ✭ 22 (-18.52%)
Mutual labels:  pipeline, reproducible-research
scikit-learn-mooc
Machine learning in Python with scikit-learn MOOC
Stars: ✭ 783 (+2800%)
Mutual labels:  scikit-learn
Cubist
A Python package for fitting Quinlan's Cubist regression model
Stars: ✭ 22 (-18.52%)
Mutual labels:  scikit-learn

Reskit

Documentation Status Join the chat at https://gitter.im/ResearcherKit/Lobby Build Status codecov

Reskit (researcher’s kit) is a library for creating and curating reproducible pipelines for scientific and industrial machine learning. The natural extension of the scikit-learn Pipelines to general classes of pipelines, Reskit allows for the efficient and transparent optimization of each pipeline step. Main features include data caching, compatibility with most of the scikit-learn objects, optimization constraints (e.g. forbidden combinations), and table generation for quality metrics. Reskit also allows for the injection of custom metrics into the underlying scikit frameworks. Reskit is intended for use by researchers who need pipelines amenable to versioning and reproducibility, yet who also have a large volume of experiments to run.

Features

  • Ability to combine pipelines with an equal number of steps in list of experiments, running them and returning results in a convenient format for human consumption (Pandas dataframe).

  • Step caching. Standard SciKit-learn pipelines cannot cache temporary steps. Reskit includes the option to save fixed steps, so in next pipeline specified steps won’t be recalculated.

  • Forbidden combination constraints. Not all possible combinations of pipelines are viable or meaningfully different. For example, in a classification task comparing the performance of logistic regression and decision trees the former requires feature scaling while the latter may not. In this case you can block the unnecessary pair. Reskit supports general tuple blocking as well.

  • Full compatibility with scikit-learn objects. Reskit can use any scikit-learn data transforming object and/or predictive model, and assumably many other libraries that uses the scikit template.

  • Evaluation of multiple performance metrics simultaneously. Evaluation is simply another step in the pipeline, so we can specify a number of possible evaluation metrics and Reskit will expand out the computations for each metric for each pipeline.

  • The DataTransformer class, which is Reskit’s simplfied interface for specifying fit/transform methods in pipeline steps. A DataTransformer subclass need only specify one function.

  • Tools for learning on graphs. Due to our original motivations Reskit includes a number of operations for network data. In particular, it allows a variety of normalization choices foradjacency matrices , as well as built in local graph metric calculations. These were implemented using DataTransformer and in some cases the BCTpy (the Brain Connectivity Toolbox python version)

Documentation

The documentation includes detailed tutorial, but for a quick view, we provide for you an example.

Example

Let's say we want to prepare data and try some scalers and classifiers for prediction in a classification problem. We will tune paramaters of classifiers by grid search technique.

Data preparing:

from sklearn.datasets import make_classification


X, y = make_classification()

Setting steps for our pipelines and parameters for grid search:

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

scalers = [
    ('minmax', MinMaxScaler()),
    ('standard', StandardScaler())
]

classifiers = [
    ('LR', LogisticRegression()),
    ('SVC', SVC())
]

steps = [
    ('Scaler', scalers),
    ('Classifier', classifiers)
]

param_grid = {
    'LR' : {
        'penalty' : ['l1', 'l2']},
    'SVC' : {
        'kernel' : ['linear', 'poly', 'rbf', 'sigmoid']}}

Setting a cross-validation for grid searching of hyperparameters and for evaluation of models with obtained hyperparameters.

from sklearn.model_selection import StratifiedKFold


grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

Creating a plan of our research:

pipeliner = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipeliner.plan_table
scaler classifier
0 standard LR
1 standard SVC
2 minmax LR
3 minmax SVC

To tune parameters of models and evaluate this models, run:

pipeliner.get_results(X, y, scoring='roc_auc')
Line: 1/4
Line: 2/4
Line: 3/4
Line: 4/4
scaler classifier grid_roc_auc_mean grid_roc_auc_std grid_roc_auc_best_params eval_roc_auc_mean eval_roc_auc_std eval_roc_auc_scores
0 standard LR 0.956 0.0338230690506 {'penalty': 'l1'} 0.968 0.0324961536185 [ 0.92 1. 1. 0.94 0.98]
1 standard SVC 0.962 0.0278567765544 {'kernel': 'poly'} 0.976 0.0300665927567 [ 0.95 1. 1. 0.93 1. ]
2 minmax LR 0.964 0.0412795348811 {'penalty': 'l1'} 0.966 0.0377359245282 [ 0.92 1. 1. 0.92 0.99]
3 minmax SVC 0.958 0.0411825205639 {'kernel': 'rbf'} 0.962 0.0401995024845 [ 0.93 1. 1. 0.9 0.98]

Installation

Reskit currently requires Python 3.4 or later to run. Please install Python and pip via the package manager of your operating system if it is not included already.

Reskit depends on:

  • numpy
  • scikit-learn
  • pandas
  • scipy
  • python-igraph
  • networkx

If you don't want to use graph metrics, you should to comment scipy, python-igraph and networkx in requirements.txt.

To install dependencies run next command:

pip install -r https://raw.githubusercontent.com/neuro-ml/reskit/master/requirements.txt

Last stable version is 0.1.0. You can install it via:

pip3 install -U https://github.com/neuro-ml/reskit/archive/0.1.0.zip

To install the latest development version of Reskit, run the following commands:

pip install -U https://github.com/neuro-ml/reskit/archive/master.zip

Docker

If you just want to try Reskit or don’t want to install Python, you can build docker image and make all reskit’s stuff there. Also, in this case, you can provide the simple way to reproduce your experiment. To run Reskit in docker you can use next commands.

  1. Clone repository:

    git clone https://github.com/neuro-ml/reskit.git
    cd reskit
  2. Build docker image:

    docker build -t docker-reskit -f Dockerfile .
  3. Run docker image.

  • If you want to run bash in container:

    docker run -it docker-reskit bash
  • If you want to run bash in container with shared directory:

    docker run -v $PWD/scripts:/reskit/scripts -it docker-reskit bash
  • If you want to start Jupyter Notebook server at http://localhost:8809 in container:

    docker run -v $PWD/scripts:/reskit/scripts -it -p 8809:8809 docker-reskit jupyter notebook --no-browser --ip="*" --allow-root --port 8809

    You will see message:

      Copy/paste this URL into your browser when you connect for the first time,
      to login with a token:
        http://localhost:8809/?token=some_token
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].