All Projects → EpistasisLab → Rebate

EpistasisLab / Rebate

Licence: mit
Relief Based Algorithms of ReBATE implemented in Python with Cython optimization. This repository is no longer being updated. Please see scikit-rebate.

Programming Languages

python
139335 projects - #7 most used programming language
cython
566 projects

Projects that are alternatives of or similar to Rebate

Chrispher.github.com
Data Science
Stars: ✭ 8 (-72.41%)
Mutual labels:  data-science
Bayeslite
BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
Stars: ✭ 877 (+2924.14%)
Mutual labels:  data-science
Steppy Toolkit
Curated set of transformers that make your work with steppy faster and more effective 🔭
Stars: ✭ 21 (-27.59%)
Mutual labels:  data-science
Autodl
Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL [email protected]
Stars: ✭ 854 (+2844.83%)
Mutual labels:  data-science
Data Science On Gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Stars: ✭ 864 (+2879.31%)
Mutual labels:  data-science
Pydataset
Instant access to many datasets in Python.
Stars: ✭ 880 (+2934.48%)
Mutual labels:  data-science
Data Science Interview Questions And Answers
Data science interview questions with answers. Not ideally (yet)
Stars: ✭ 842 (+2803.45%)
Mutual labels:  data-science
Workshop
课题组每周研讨会
Stars: ✭ 28 (-3.45%)
Mutual labels:  data-science
Pydata.kr
PyData Korea 공식 홈페이지입니다. (준비중)
Stars: ✭ 13 (-55.17%)
Mutual labels:  data-science
Ethereumdb
Stars: ✭ 21 (-27.59%)
Mutual labels:  data-science
Dataflowjavasdk
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Stars: ✭ 854 (+2844.83%)
Mutual labels:  data-science
Awesome Google Colab
Google Colaboratory Notebooks and Repositories (by @firmai)
Stars: ✭ 863 (+2875.86%)
Mutual labels:  data-science
Clevercsv
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
Stars: ✭ 887 (+2958.62%)
Mutual labels:  data-science
Vds
Verteego Data Suite
Stars: ✭ 9 (-68.97%)
Mutual labels:  data-science
Intro Python
Python pour Statistique et Science des Données -- Syntaxe, Trafic de Données, Graphes, Programmation, Apprentissage
Stars: ✭ 21 (-27.59%)
Mutual labels:  data-science
Awesome Fraud Detection Papers
A curated list of data mining papers about fraud detection.
Stars: ✭ 843 (+2806.9%)
Mutual labels:  data-science
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+28620.69%)
Mutual labels:  data-science
Mlnet Workshop
ML.NET Workshop to predict car sales prices
Stars: ✭ 29 (+0%)
Mutual labels:  data-science
Machine Learning Open Source
Monthly Series - Machine Learning Top 10 Open Source Projects
Stars: ✭ 943 (+3151.72%)
Mutual labels:  data-science
Crime Analysis
Association Rule Mining from Spatial Data for Crime Analysis
Stars: ✭ 20 (-31.03%)
Mutual labels:  data-science

Master status: Master Build Status Master Coverage Status

Development status: Development Build Status Development Coverage Status

Package information: Python 2.7 Python 3.5 License

ReBATE (Relief-based Algorithm Training Environment)

This package includes stand-alone Python code to run any of the included/available Relief-Based algorithms (RBAs) designed for feature weighting/selection as part of a machine learning pipeline (supervised learning). Presently this includes the following core RBAs: ReliefF, SURF, SURF*, MultiSURF* and MultiSURF. Additionally, an implementation of the iterative TuRF mechanism is included. It is still under active development and we encourage you to check back on this repository regularly for updates.

These algorithms offer a computationally efficient way to perform feature selection that is sensitive to feature interactions as well as simple univariate associations, unlike most currently available filter-based feature selection methods. The main benefit of Relief algorithms is that they identify feature interactions without having to exhaustively check every pairwise interaction, thus taking significantly less time than exhaustive pairwise search.

Each core algorithm outputs an ordered set of feature names along with respective feature scores (i.e. weights). Certain algorithms require user specified run parameters (e.g. ReliefF requires the user to specify some 'k' number of nearest neighbors).

Relief algorithms are commonly applied to genetic analyses, where epistasis (i.e., feature interactions) is common. However, the algorithms implemented in this package can be applied to almost any supervised classification data set and supports:

  • Feature sets that are discrete/categorical, continuous-valued or a mix of both

  • Data with missing values

  • Binary endpoints (i.e., classification)

  • Multi-class endpoints (i.e., classification)

  • Continuous endpoints (i.e., regression)

Built into this code, is a strategy to 'automatically' detect from the loaded data, these relevant characteristics.

Of our two initial ReBATE software releases, this stand-alone version primarily focuses on improving run-time with the use of Cython. This code is most appropriate for more experienced users or those primarily interested in reducing analysis run time.

We recommend that scikit-learn users, Windows operating system users, beginners, or those looking for the most recent ReBATE developments to instead use our alternate scikit-rebate implementation. ReBATE can be run on Windows with some additional installation steps and possible troubleshooting outlined below.

License

Please see the repository license for the licensing and usage information for ReBATE. Generally, we have licensed ReBATE to make it as widely usable as possible.

Cython (Important Notice)

NOTICE: As is, this code will not run on your local platform! Portions of this code have been optimized with Cython routines for code speedup. As a result, before being able to use ReBATE on a given operating system (i.e. Linux, Mac, or Windows), critical binary files must be compiled as a one time step (or any time the underlying source code is modified, or any time an updated version of ReBATE is downloaded to your system. Compiling the necessary binary files is very easy to do on Mac or Linux systems (because they include a C compiler). However Windows users will unfortunately have to go through a few extra hurdles in order to complete this one time step. If you wish to avoid this hassle, please see our alternate scikit-rebate implementation.

Installation

For detailed information on installing ReBATE, including necessary prerequisites, special instructions for Windows users, and instructions for compiling cython, please refer to our installation documentation.

Running ReBATE

From the '/rebate/' directory, run the following to view all available options:

./rebate.py -h

For detailed information and examples of how to run the different Relief algorithms available in this package, please refer to our usage documentation.

Contributing to ReBATE

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to ReBATE, please file a new issue so we can discuss it.

If you wish to contribute to ReBATE we strongly recommend following the steps detailed in contributing documentation.

Citing ReBATE

If you use ReBATE or the MultiSURF algorithm in a scientific publication, please consider citing the following paper:

Ryan J. Urbanowicz, Randal S. Olson, Peter Schmitt, Melissa Meeker, Jason H. Moore (2017). Benchmarking Relief-Based Feature Selection Methods. arXiv preprint, under review.

BibTeX entry:

@misc{Urbanowicz2017Benchmarking,
    author = {Urbanowicz, Ryan J. and Olson, Randal S. and Schmitt, Peter and Meeker, Melissa and Moore, Jason H.},
    title = {Benchmarking Relief-Based Feature Selection Methods},
    year = {2017},
    howpublished = {arXiv e-print. https://arxiv.org/abs/1711.08477},
}

If you wish to directly cite the original paper for one of the other algorithms implemented in ReBATE please refer to our citing documentation.

History

This code is largely based on Python implementations of ReliefF, SURF, SURF*, MultiSURF*, and TuRF within the ExSTraCS algorithm software. That Python code was in turn based on Java implementations of these algorithms within the Multifactor Dimensionality Reduction (MDR) software. In contrast with the MDR implementations, both the ExSTraCS and scikit-rebate, and present ReBATE versions of this code have been expanded to accommodate the following data considerations: Continuous features, a mix of discrete and continuous features, a continuous endpoint/outcome, and missing data values.

Possible future updates

  1. Make this an installable package
  2. Convert to Classes
  3. Create GUI Interface
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].