All Projects → alegonz → Baikal

alegonz / Baikal

Licence: bsd-3-clause
A graph-based functional API for building complex scikit-learn pipelines.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Baikal

Code
Compilation of R and Python programming codes on the Data Professor YouTube channel.
Stars: ✭ 287 (-49.91%)
Mutual labels:  data-science, scikit-learn
Thesemicolon
This repository contains Ipython notebooks and datasets for the data analytics youtube tutorials on The Semicolon.
Stars: ✭ 345 (-39.79%)
Mutual labels:  data-science, scikit-learn
Autogluon
AutoGluon: AutoML for Text, Image, and Tabular Data
Stars: ✭ 3,920 (+584.12%)
Mutual labels:  data-science, scikit-learn
Sk Dist
Distributed scikit-learn meta-estimators in PySpark
Stars: ✭ 260 (-54.62%)
Mutual labels:  data-science, scikit-learn
Data Science Portfolio
Portfolio of data science projects completed by me for academic, self learning, and hobby purposes.
Stars: ✭ 559 (-2.44%)
Mutual labels:  data-science, scikit-learn
Nimbusml
Python machine learning package providing simple interoperability between ML.NET and scikit-learn components.
Stars: ✭ 265 (-53.75%)
Mutual labels:  data-science, scikit-learn
Scikit Learn Videos
Jupyter notebooks from the scikit-learn video series
Stars: ✭ 3,254 (+467.89%)
Mutual labels:  data-science, scikit-learn
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (-61.95%)
Mutual labels:  data-science, scikit-learn
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+3747.82%)
Mutual labels:  data-science, scikit-learn
Machinejs
[UNMAINTAINED] Automated machine learning- just give it a data file! Check out the production-ready version of this project at ClimbsRocks/auto_ml
Stars: ✭ 412 (-28.1%)
Mutual labels:  data-science, scikit-learn
Datacamp Python Data Science Track
All the slides, accompanying code and exercises all stored in this repo. 🎈
Stars: ✭ 250 (-56.37%)
Mutual labels:  data-science, scikit-learn
Palladium
Framework for setting up predictive analytics services
Stars: ✭ 481 (-16.06%)
Mutual labels:  data-science, scikit-learn
Orange3
🍊 📊 💡 Orange: Interactive data analysis
Stars: ✭ 3,152 (+450.09%)
Mutual labels:  data-science, scikit-learn
Sagify
MLOps for AWS SageMaker. www.sagifyml.com
Stars: ✭ 277 (-51.66%)
Mutual labels:  data-science, scikit-learn
Igel
a delightful machine learning tool that allows you to train, test, and use models without writing code
Stars: ✭ 2,956 (+415.88%)
Mutual labels:  data-science, scikit-learn
Sklearn Evaluation
Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.
Stars: ✭ 294 (-48.69%)
Mutual labels:  data-science, scikit-learn
Lale
Library for Semi-Automated Data Science
Stars: ✭ 198 (-65.45%)
Mutual labels:  data-science, scikit-learn
Eli5
A library for debugging/inspecting machine learning classifiers and explaining their predictions
Stars: ✭ 2,477 (+332.29%)
Mutual labels:  data-science, scikit-learn
Sktime
A unified framework for machine learning with time series
Stars: ✭ 4,741 (+727.4%)
Mutual labels:  data-science, scikit-learn
Best Of Ml Python
🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
Stars: ✭ 6,057 (+957.07%)
Mutual labels:  data-science, scikit-learn

baikal

A graph-based functional API for building complex scikit-learn pipelines

docs build status coverage Language grade: Python code style latest release license

baikal is written in pure Python. It supports Python 3.5 and above.

Note: baikal is still a young project and there might be backward incompatible changes. The next development steps and backwards-incompatible changes are announced and discussed in this issue. Please subscribe to it if you use baikal.

What is baikal?

baikal is a graph-based, functional API for building complex machine learning pipelines of objects that implement the scikit-learn API. It is mostly inspired on the excellent Keras API for Deep Learning, and borrows a few concepts from the TensorFlow framework and the (perhaps lesser known) graphkit package.

baikal aims to provide an API that allows to build complex, non-linear machine learning pipelines that look like this:

multiple_input_nonlinear_pipeline_example_diagram

with code that looks like this:

x1 = Input()
x2 = Input()
y_t = Input()

y1 = ExtraTreesClassifier()(x1, y_t)
y2 = RandomForestClassifier()(x2, y_t)
z = PowerTransformer()(x2)
z = PCA()(z)
y3 = LogisticRegression()(z, y_t)

ensemble_features = Stack()([y1, y2, y3])
y = SVC()(ensemble_features, y_t)

model = Model([x1, x2], y, y_t)

What can I do with it?

With baikal you can

  • build non-linear pipelines effortlessly
  • handle multiple inputs and outputs
  • add steps that operate on targets as part of the pipeline
  • nest pipelines
  • use prediction probabilities (or any other kind of output) as inputs to other steps in the pipeline
  • query intermediate outputs, easing debugging
  • freeze steps that do not require fitting
  • define and add custom steps easily
  • plot pipelines

All with boilerplate-free, readable code.

Why baikal?

The pipeline above (to the best of the author's knowledge) cannot be easily built using scikit-learn's composite estimators API as you encounter some limitations:

  1. It is aimed at linear pipelines
    • You could add some step parallelism with the ColumnTransformer API, but this is limited to transformer objects.
  2. Classifiers/Regressors can only be used at the end of the pipeline.
    • This means we cannot use the predicted labels (or their probabilities) as features to other classifiers/regressors.
    • You could leverage mlxtend's StackingClassifier and come up with some clever combination of the above composite estimators (Pipelines, ColumnTransformers, and StackingClassifiers, etc), but you might end up with code that feels hard-to-follow and verbose.
  3. Cannot handle multiple input/multiple output models.

Perhaps you could instead define a big, composite estimator class that integrates each of the pipeline steps through composition. This, however, most likely will require

  • writing big __init__ methods to control each of the internal steps' knobs;
  • being careful with get_params and set_params if you want to use, say, GridSearchCV;
  • and adding some boilerplate code if you want to access the outputs of intermediate steps for debugging.

By using baikal as shown in the example above, code can be more readable, less verbose and closer to our mental representation of the pipeline. baikal also provides an API to fit, predict with, and query the entire pipeline with single commands.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].