All Projects → riken-aip → pyHSICLasso

riken-aip / pyHSICLasso

Licence: MIT license
Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pyHSICLasso

zoofs
zoofs is a python library for performing feature selection using a variety of nature-inspired wrapper algorithms. The algorithms range from swarm-intelligence to physics-based to Evolutionary. It's easy to use , flexible and powerful tool to reduce your feature size.
Stars: ✭ 142 (+13.6%)
Mutual labels:  machine-learning-algorithms, feature-selection
featurewiz
Use advanced feature engineering strategies and select best features from your data set with a single line of code.
Stars: ✭ 229 (+83.2%)
Mutual labels:  feature-selection, feature-extraction
feature engine
Feature engineering package with sklearn like functionality
Stars: ✭ 758 (+506.4%)
Mutual labels:  feature-selection, feature-extraction
PyImpetus
PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features
Stars: ✭ 83 (-33.6%)
Mutual labels:  machine-learning-algorithms, feature-selection
Php Ml
PHP-ML - Machine Learning library for PHP
Stars: ✭ 7,900 (+6220%)
Mutual labels:  machine-learning-algorithms, feature-extraction
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (+25.6%)
Mutual labels:  machine-learning-algorithms, feature-extraction
Nni
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Stars: ✭ 10,698 (+8458.4%)
Mutual labels:  machine-learning-algorithms, feature-extraction
50-days-of-Statistics-for-Data-Science
This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.
Stars: ✭ 19 (-84.8%)
Mutual labels:  feature-selection, feature-extraction
2019-feature-selection
Research project
Stars: ✭ 26 (-79.2%)
Mutual labels:  feature-selection
Machine-Learning-Explained
Learn the theory, math and code behind different machine learning algorithms and techniques.
Stars: ✭ 30 (-76%)
Mutual labels:  machine-learning-algorithms
generalized-additive-models-workshop-2019
A workshop on using generalized additive models and the mgcv package.
Stars: ✭ 23 (-81.6%)
Mutual labels:  nonlinear
vlainic.github.io
My GitHub blog: things you might be interested, and probably not...
Stars: ✭ 26 (-79.2%)
Mutual labels:  machine-learning-algorithms
PyLDA
A Latent Dirichlet Allocation implementation in Python.
Stars: ✭ 51 (-59.2%)
Mutual labels:  machine-learning-algorithms
face-authentication
Face-authentication system enterely written on C++ with OpenCV and Qt third party library. Face-antispoofing procedure is included.
Stars: ✭ 49 (-60.8%)
Mutual labels:  feature-extraction
mildnet
Visual Similarity research at Fynd. Contains code to reproduce 2 of our research papers.
Stars: ✭ 76 (-39.2%)
Mutual labels:  feature-extraction
machine-learning-implemetation-python
Basic Machine Learning implementation with python
Stars: ✭ 51 (-59.2%)
Mutual labels:  machine-learning-algorithms
Cerbo
Perform Efficient ML/DL Modelling easily
Stars: ✭ 12 (-90.4%)
Mutual labels:  machine-learning-algorithms
Tf-Rec
Tf-Rec is a python💻 package for building⚒ Recommender Systems. It is built on top of Keras and Tensorflow 2 to utilize GPU Acceleration during training.
Stars: ✭ 18 (-85.6%)
Mutual labels:  machine-learning-algorithms
Stock-Selection-a-Framework
This project demonstrates how to apply machine learning algorithms to distinguish "good" stocks from the "bad" stocks.
Stars: ✭ 239 (+91.2%)
Mutual labels:  feature-selection
AgePredictor
Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum
Stars: ✭ 13 (-89.6%)
Mutual labels:  machine-learning-algorithms

pyHSICLasso

pypi MIT License Build Status

pyHSICLasso is a package of the Hilbert Schmidt Independence Criterion Lasso (HSIC Lasso), which is a black box (nonlinear) feature selection method considering the nonlinear input and output relationship. HSIC Lasso can be regarded as a convex variant of widely used minimum redundancy maximum relevance (mRMR) feature selection algorithm.

Advantage of HSIC Lasso

  • Can find nonlinearly related features efficiently.
  • Can find non-redundant features.
  • Can obtain a globally optimal solution.
  • Can deal with both regression and classification problems through kernels.

Feature Selection

The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. By using this, you can supplement the dependence of nonlinear input and output and you can calculate the optimal solution efficiently for high dimensional problem. The effectiveness are demonstrated through feature selection experiments for classification and regression with thousands of features. Finding a subset of features in high-dimensional supervised learning is an important problem with many real- world applications such as gene selection from microarray data, document categorization, and prosthesis control.

Install

$ pip install -r requirements.txt
$ python setup.py install

or

$ pip install pyHSICLasso

Usage

First, pyHSICLasso provides the single entry point as class HSICLasso()

This class has the following methods.

  • input
  • regression
  • classification
  • dump
  • plot_path
  • plot_dendrogram
  • plot_heatmap
  • get_features
  • get_features_neighbors
  • get_index
  • get_index_score
  • get_index_neighbors
  • get_index_neighbors_score
  • save_param

The input format corresponds to the following formats.

  • MATLAB file (.mat)
  • .csv
  • .tsv
  • numpy's ndarray

Input file

When using .mat, .csv, .tsv, we support pandas dataframe. The rows of the dataframe are sample number. The output variable should have class tag. If you wish to use your own tag, you need to specify the output variables by list (output_list=['tag']) The remaining columns are values of each features. The following is a sample data (csv format).

class,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10
-1,2,0,0,0,-2,0,-2,0,2,0
1,2,2,0,0,-2,0,0,0,2,0
...

For multi-variate output cases, you can specify the output by using the list (output_list). See Sample code for details.

Save results to a csv file

If you want to save the feature selection results in csv file, please call the following function:

>>> hsic_lasso.save_param()

To get rid of specific covariates effect

In biology applications, we may want to get rid of the effect of some covariates such as gender and/or age. In such cases, we can pre-specify the covariates X in classification or regression functions as

>>> hsic_lasso.regression(5,covars=X)

>>> hsic_lasso.classification(10,covars=X)

Please check the example/sample_covars.py for details.

To handle large number of samples

HSIC Lasso scales well with respect to the number of features d. However, the vanilla HSIC Lasso requires O(dn^2) memory space and may run out the memory if the number of samples n is more than 1000. In such case, we can use the block HSIC Lasso which requires only O(dnBM) space, where B << n is the block parameter and M is the permutation parameter to stabilize the final result. This can be done by specifying B and M parameters in the regression or classification function. Currently, the default parameters are B=20 and M=3, respectively. If you wish to use the vanilla HSIC Lasso, please use B=0 and M=1.

Example

>>> from pyHSICLasso import HSICLasso
>>> hsic_lasso = HSICLasso()

>>> hsic_lasso.input("data.mat")

>>> hsic_lasso.input("data.csv")

>>> hsic_lasso.input("data.tsv")

>>> hsic_lasso.input(np.array([[1, 1, 1], [2, 2, 2]]), np.array([0, 1]))

You can specify the number of subset of feature selections with arguments regression and classification.

>>> hsic_lasso.regression(5)

>>> hsic_lasso.classification(10)

About output method, it is possible to select plots on the graph, details of the analysis result, output of the feature index. Note, to run the dump() function, it needs at least 5 features in the dataset.

>>> hsic_lasso.plot()
# plot the graph

>>> hsic_lasso.dump()
============================================== HSICLasso : Result ==================================================
| Order | Feature      | Score | Top-5 Related Feature (Relatedness Score)                                          |
| 1     | 1100         | 1.000 | 100          (0.979), 385          (0.104), 1762         (0.098), 762          (0.098), 1385         (0.097)|
| 2     | 100          | 0.537 | 1100         (0.979), 385          (0.100), 1762         (0.095), 762          (0.094), 1385         (0.092)|
| 3     | 200          | 0.336 | 1200         (0.979), 264          (0.094), 1482         (0.094), 1264         (0.093), 482          (0.091)|
| 4     | 1300         | 0.140 | 300          (0.984), 1041         (0.107), 1450         (0.104), 1869         (0.102), 41           (0.101)|
| 5     | 300          | 0.033 | 1300         (0.984), 1041         (0.110), 41           (0.106), 1450         (0.100), 1869         (0.099)|
>>> hsic_lasso.get_index()
[1099, 99, 199, 1299, 299]

>>> hsic_lasso.get_index_score()
array([0.09723658, 0.05218047, 0.03264885, 0.01360242, 0.00319763])

>>> hsic_lasso.get_features()
['1100', '100', '200', '1300', '300']

>>> hsic_lasso.get_index_neighbors(feat_index=0,num_neighbors=5)
[99, 384, 1761, 761, 1384]

>>> hsic_lasso.get_features_neighbors(feat_index=0,num_neighbors=5)
['100', '385', '1762', '762', '1385']

>>> hsic_lasso.get_index_neighbors_score(feat_index=0,num_neighbors=5)
array([0.9789888 , 0.10350618, 0.09757666, 0.09751763, 0.09678892])

>>> hsic_lasso.save_param() #Save selected features and its neighbors 

Citation

If you use this softwawre for your research, please cite the following two papers: Original HSIC Lasso and its block counterparts.

@article{yamada2014high,
  title={High-dimensional feature selection by feature-wise kernelized lasso},
  author={Yamada, Makoto and Jitkrittum, Wittawat and Sigal, Leonid and Xing, Eric P and Sugiyama, Masashi},
  journal={Neural computation},
  volume={26},
  number={1},
  pages={185--207},
  year={2014},
  publisher={MIT Press}
}

@article{climente2019block,
  title={Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data},
  author={Climente-Gonz{\'a}lez, H{\'e}ctor and Azencott, Chlo{\'e}-Agathe and Kaski, Samuel and Yamada, Makoto},
  journal={Bioinformatics},
  volume={35},
  number={14},
  pages={i427--i435},
  year={2019},
  publisher={Oxford University Press}
}

References

Algorithms

Theory

HSIC Lasso based algorithms

Applications of HSIC Lasso

  • Takahashi, Y., Ueki, M., Yamada, M., Tamiya, G., Motoike, I., Saigusa, D., Sakurai, M., Nagami, F., Ogishima, S., Koshiba, S., Kinoshita, K., Yamamoto, M., Tomita, H. Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection. Translational Psychiatry volume 10, Article number: 157 (2020).

Contributors

Developers

Name : Makoto Yamada (Kyoto University/RIKEN AIP), Héctor Climente-González (RIKEN AIP)

E-mail : [email protected]

Distributor

Name : Hirotaka Suetake (RIKEN AIP)

E-mail : [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].