Random Fourier Features

This repository provides Python module rfflearn which is a library of random Fourier features [1, 2] for kernel method, like support vector machine and Gaussian process model. Features of this module are:

interfaces of the module are quite close to the scikit-learn,
support vector classifier and Gaussian process regressor/classifier provides CPU/GPU training and inference,
interface to optuna for easier hyper parameter tuning,
this repository provides example code that shows RFF is useful for actual machine learning tasks.

Now, this module supports the following methods:

Method	CPU support	GPU support
canonical correlation analysis	`rfflearn.cpu.RFFCCA`	-
Gaussian process regression	`rfflearn.cpu.RFFGPR`	`rfflearn.gpu.RFFGPR`
Gaussian process classification	`rfflearn.cpu.RFFGPC`	`rfflearn.gpu.RFFGPC`
principal component analysis	`rfflearn.cpu.RFFPCA`	`rfflearn.gpu.RFFPCA`
regression	`rfflearn.cpu.RFFRegression`	-
support vector classification	`rfflearn.cpu.RFFSVC`	`rfflearn.gpu.RFFSVC`
support vector regression	`rfflearn.cpu.RFFSVR`	-

RFF can be applicable for many other machine learning algorithms, I will provide other functions soon.

Minimal example

Interfaces provided by our module is quite close to scikit-learn. For example, the following Python code is a sample usage of RFFSVC (support vector machine with random Fourier features) class.

>>> import numpy as np
>>> import rfflearn.cpu as rfflearn                     # Import module
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])  # Define input data
>>> y = np.array([1, 1, 2, 2])                          # Defile label data
>>> svc = rfflearn.RFFSVC().fit(X, y)                   # Training (on CPU)
>>> svc.score(X, y)                                     # Inference (on CPU)
1.0
>>> svc.predict(np.array([[-0.8, -1]]))
array([1])

This module supports training/inference on GPU. For example, the following Python code is a sample usage of RFFGPC (Gaussian process classifier with random Fourier features) on GPU. The following code requires PyTorch (>= 1.7.0).

>>> import numpy as np
>>> import rfflearn.gpu as rfflearn                     # Import module
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])  # Define input data
>>> y = np.array([1, 1, 2, 2])                          # Defile label data
>>> gpc = rfflearn.RFFGPC().fit(X, y)                   # Training on GPU
>>> gpc.score(X, y)                                     # Inference on GPU
1.0
>>> gpc.predict(np.array([[-0.8, -1]]))
array([1])

See examples directory for more detailed examples.

Example1: MNIST using random Fourier features

I tried SVC (support vector classifier) and GPC (Gaussian process classifire) with RFF to the MNIST dataset which is one of the famous benchmark dataset on the image classification task, and I've got better performance and much faster inference speed than kernel SVM. The following table gives a brief comparison of kernel SVM, SVM with RFF and GPC with RFF. See the example of RFF SVC module and RFF GP module for mode details.

Method	RFF dimension	Inference time (us)	Score (%)
Kernel SVM	-	4644.9 us	96.3 %
RFF SVC	512	39.0 us	96.5 %
RFF SVC	1024	96.1 us	97.5 %
RFF SVC (GPU)	1024	2.38 us	97.5 %
RFF GPC	5120	342.1 us	98.2 %
RFF GPC (GPU)	5120	115.0 us	98.2 %

Example2: Visualization of feature importance

This module also have interfaces to some feature importance methods, like SHAP [3] and permutation importance [4]. I tried SHAP and permutation importance to RFFGPR trained by Boston house-price dataset, and the following is the visualization results obtained by rfflearn.shap_feature_importance and rfflearn.permutation_feature_importance.

Permutation importances of Boston housing dataset

SHAP importances of Boston housing dataset

Requirements and installation

The author recommend to use docker image for building environment, however, of course, you can install necessary packages on your environment. See SETUP.md for more details.

Notes

Name of this module is changed from pyrff to rfflearn on Oct 2020, because the package pyrff already exists in PyPI.
If a number of training data are huge, error message like RuntimeError: The task could not be sent to the workers as it is too large for 'send_bytes' will be raised from the joblib library. The reason for this error is that the sklearn.svm.LinearSVC uses joblib as a multiprocessing backend, but joblib cannot deal huge size of the array which cannot be managed with 32-bit address space. In this case, please try n_jobs = 1 option for the RFFSVC or ORFSVC function. Default settings are n_jobs = -1 which means automatically detecting available CPUs and using them. (This bug information was reported by Mr. Katsuya Terahata @ Toyota Research Institute Advanced Development. Thank you so much for the reporting!)
Applucation of RFF to the Gaussian process is not straight forward. See this document for mathematical details.

Licence

MIT Licence

Reference

[1] A. Rahimi and B. Recht, "Random Features for Large-Scale Kernel Machines", NIPS, 2007. PDF

[2] F. X. Yu, A. T. Suresh, K. Choromanski, D. Holtmann-Rice and S. Kumar, "Orthogonal Random Features", NIPS, 2016. PDF

[3] S. M. Lundberg and S. Lee, "A Unified Approach to Interpreting Model Predictions", NIPS, 2017. PDF

[4] L. Breiman, "Random Forests", Machine Learning, vol. 45, pp. 5-32, Springer, 2001. Springer website.

Author

Tetsuya Ishikawa (EMail, Website)

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

tiskw / random-fourier-features

Programming Languages

Labels

Projects that are alternatives of or similar to random-fourier-features