All Projects → dclambert → Pyensemble

dclambert / Pyensemble

Licence: other
An implementation of Caruana et al's Ensemble Selection algorithm in Python, based on scikit-learn

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pyensemble

Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+945.52%)
Mutual labels:  scikit-learn
Nyoka
Nyoka is a Python library to export ML/DL models into PMML (PMML 4.4.1 Standard).
Stars: ✭ 127 (-12.41%)
Mutual labels:  scikit-learn
Practical Machine Learning With Python
Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.
Stars: ✭ 1,868 (+1188.28%)
Mutual labels:  scikit-learn
Onnx
Open standard for machine learning interoperability
Stars: ✭ 11,829 (+8057.93%)
Mutual labels:  scikit-learn
Pbpython
Code, Notebooks and Examples from Practical Business Python
Stars: ✭ 1,724 (+1088.97%)
Mutual labels:  scikit-learn
Hep ml
Machine Learning for High Energy Physics.
Stars: ✭ 133 (-8.28%)
Mutual labels:  scikit-learn
Studybook
Study E-Book(ComputerVision DeepLearning MachineLearning Math NLP Python ReinforcementLearning)
Stars: ✭ 1,457 (+904.83%)
Mutual labels:  scikit-learn
Ml Forex Prediction
Predicting Forex Future Price with Machine Learning
Stars: ✭ 142 (-2.07%)
Mutual labels:  scikit-learn
Dive Into Machine Learning
Dive into Machine Learning with Python Jupyter notebook and scikit-learn! First posted in 2016, maintained as of 2021. Pull requests welcome.
Stars: ✭ 10,810 (+7355.17%)
Mutual labels:  scikit-learn
Interactive machine learning
IPython widgets, interactive plots, interactive machine learning
Stars: ✭ 140 (-3.45%)
Mutual labels:  scikit-learn
Python Flask Sklearn Docker Template
A simple example of python api for real time machine learning, using scikit-learn, Flask and Docker
Stars: ✭ 117 (-19.31%)
Mutual labels:  scikit-learn
Auto ml
[UNMAINTAINED] Automated machine learning for analytics & production
Stars: ✭ 1,559 (+975.17%)
Mutual labels:  scikit-learn
Qlik Py Tools
Data Science algorithms for Qlik implemented as a Python Server Side Extension (SSE).
Stars: ✭ 135 (-6.9%)
Mutual labels:  scikit-learn
Ml Email Clustering
Email clustering with machine learning
Stars: ✭ 116 (-20%)
Mutual labels:  scikit-learn
Python Cheat Sheet
Python Cheat Sheet NumPy, Matplotlib
Stars: ✭ 1,739 (+1099.31%)
Mutual labels:  scikit-learn
Daal4py
sources for daal4py - a convenient Python API to oneDAL
Stars: ✭ 113 (-22.07%)
Mutual labels:  scikit-learn
Pydata Chicago2016 Ml Tutorial
Machine learning with scikit-learn tutorial at PyData Chicago 2016
Stars: ✭ 128 (-11.72%)
Mutual labels:  scikit-learn
Python Machine Learning Book
The "Python Machine Learning (1st edition)" book code repository and info resource
Stars: ✭ 11,428 (+7781.38%)
Mutual labels:  scikit-learn
Py4chemoinformatics
Python for chemoinformatics
Stars: ✭ 140 (-3.45%)
Mutual labels:  scikit-learn
Gesture Recognition
✋ Recognizing "Hand Gestures" using OpenCV and Python.
Stars: ✭ 136 (-6.21%)
Mutual labels:  scikit-learn

pyensemble v0.41

---> ARCHIVED March 2021 <---

An implementation of [Caruana et al's Ensemble Selection algorithm] (http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf) [1][2] in Python, based on scikit-learn.
From the abstract:

We present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings. Forward stepwise selection is used to add to the ensemble the models that maximize its performance. Ensemble selection allows ensembles to be optimized to performance metrics such as accuracy, cross entropy, mean precision or ROC Area. Experiments with seven test problems and ten metrics demonstrate the benefit of ensemble selection.

It's a work in progress, so things can/might/will change.

David C. Lambert
dcl [at] panix [dot] com

Copyright © 2013
License: Simple BSD

Files

ensemble.py

Containing the EnsembleSelectionClassifier object

The EnsembleSelectionClassifier object tries to implement all of the methods in the combined paper, including internal cross validation, bagged ensembling, initialization with the best models, pruning of the worst models prior to selection, and sampling with replacement of the model candidates.

It uses sqlite as the backing store containing pickled unfitted models, fitted model 'siblings' for each internal cross validation fold, scores and predictions for each model, and the list of model ids and weightings for the final ensemble.

Hillclimbing can be performed using auc, accuracy, rmse, cross entropy or F1 score.

If the object is initialized with the model parameter equal to None, the object tries to load a fitted ensemble from the database specified.

(NOTE: Expects class labels to be sequential integers starting at zero [for now].)

model_library.py

Example model library building code.

ensemble_train.py

Training utility to run ensemble selection on svm data files.

The user can choose from the following candidate models:

  • sgd : Stochastic Gradient Descent
  • svc : Support Vector Machines
  • gbc : Gradient Boosting Classifiers
  • dtree : Decision Trees
  • forest : Random Forests
  • extra : Extra Trees
  • kmp : KMeans->LogisticRegression Pipelines
  • kernp : Nystroem Approx->Logistic Regression Pipelines

Some model choices are very slow. The default is to use decision trees, which are reasonably fast.

The simplest command line is:

unix> ./ensemble_train.py some_dbfile.db some_data.svm

(NOTE: Expects 'some_dbfile.db' not to exist, and will quit if it does [so you don't accidentally blow away your model].)

Full usage is:

usage: ensemble_train.py [-h]
                         [-M {svc,sgd,gbc,dtree,forest,extra,kmp,kernp}
                            [{svc,sgd,gbc,dtree,forest,extra,kmp,kernp} ...]]
                         [-S {f1,auc,rmse,accuracy,xentropy}] [-b N_BAGS]
                         [-f BAG_FRACTION] [-B N_BEST] [-m MAX_MODELS]
                         [-F N_FOLDS] [-p PRUNE_FRACTION] [-u] [-U]
                         [-e EPSILON] [-t TEST_SIZE] [-s SEED] [-v]
                         db_file data_file

EnsembleSelectionClassifier training harness

positional arguments:
  db_file               sqlite db file for backing store
  data_file             training data in svm format

optional arguments:
  -h, --help            show this help message and exit
  -M {svc,sgd,gbc,dtree,forest,extra,kmp,kernp}
    [{svc,sgd,gbc,dtree,forest,extra,kmp,kernp} ...]
                        model types to include as ensemble candidates
                        (default: ['dtree'])
  -S {f1,auc,rmse,accuracy,xentropy}
                        scoring metric used for hillclimbing (default:
                        accuracy)
  -b N_BAGS             bags to create (default: 20)
  -f BAG_FRACTION       fraction of models in each bag (after pruning)
                        (default: 0.25)
  -B N_BEST             number of best models in initial ensemble (default: 5)
  -m MAX_MODELS         maximum number of models per bagged ensemble (default:
                        25)
  -F N_FOLDS            internal cross-validation folds (default: 3)
  -p PRUNE_FRACTION     fraction of worst models pruned pre-selection
                        (default: 0.75)
  -u                    use epsilon to stop adding models (default: False)
  -U                    use bootstrap sample to generate training/hillclimbing
                        folds (default: False)
  -e EPSILON            score improvement threshold to include new model
                        (default: 0.0001)
  -t TEST_SIZE          fraction of data to use for testing (default: 0.75)
  -s SEED               random seed
  -v                    show progress messages

ensemble_predict.py

Get predictions from trained EnsembleSelectionClassifier given svm format data file.

Can output predicted classes or probabilities from the full ensemble or just the best model.

Expects to find a trained ensemble in the sqlite db specified.

usage: ensemble_predict.py [-h] [-s {best,ens}] [-p] db_file data_file

Get EnsembleSelectionClassifier predictions

positional arguments:
  db_file        sqlite db file containing model
  data_file      testing data in svm format

optional arguments:
  -h, --help     show this help message and exit
  -s {best,ens}  choose source of prediction ["best", "ens"]
  -p             predict probabilities

Requirements

Written using Python 2.7.3, numpy 1.6.1, scipy 0.10.1, scikit-learn 0.14.1 and sqlite 3.7.14

References

[1] Caruana, et al, "Ensemble Selection from Libraries of Rich Models", Proceedings of the 21st International Conference on Machine Learning (ICML `04).

[2] Caruana, et al, "Getting the Most Out of Ensemble Selection", Proceedings of the 6th International Conference on Data Mining (ICDM `06).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].