All Projects → jasonlaska → Spherecluster

jasonlaska / Spherecluster

Licence: mit
Clustering routines for the unit sphere

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Spherecluster

Adversarial Robustness Toolbox
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
Stars: ✭ 2,638 (+1104.57%)
Mutual labels:  scikit-learn
Imodels
Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: ✭ 194 (-11.42%)
Mutual labels:  scikit-learn
Sklearn Onnx
Convert scikit-learn models and pipelines to ONNX
Stars: ✭ 206 (-5.94%)
Mutual labels:  scikit-learn
Practical Machine Learning With Python
Machine Learning Tutorials in Python
Stars: ✭ 183 (-16.44%)
Mutual labels:  scikit-learn
Bet On Sibyl
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
Stars: ✭ 190 (-13.24%)
Mutual labels:  scikit-learn
Explainx
Explainable AI framework for data scientists. Explain & debug any blackbox machine learning model with a single line of code.
Stars: ✭ 196 (-10.5%)
Mutual labels:  scikit-learn
Mars
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
Stars: ✭ 2,308 (+953.88%)
Mutual labels:  scikit-learn
Stocksensation
基于情感字典和机器学习的股市舆情情感分类可视化Web
Stars: ✭ 215 (-1.83%)
Mutual labels:  scikit-learn
Sklearn Benchmarks
A centralized repository to report scikit-learn model performance across a variety of parameter settings and data sets.
Stars: ✭ 194 (-11.42%)
Mutual labels:  scikit-learn
Eli5
A library for debugging/inspecting machine learning classifiers and explaining their predictions
Stars: ✭ 2,477 (+1031.05%)
Mutual labels:  scikit-learn
Hyperactive
A hyperparameter optimization and data collection toolbox for convenient and fast prototyping of machine-learning models.
Stars: ✭ 182 (-16.89%)
Mutual labels:  scikit-learn
Pqkmeans
Fast and memory-efficient clustering
Stars: ✭ 189 (-13.7%)
Mutual labels:  scikit-learn
Data Science Projects With Python
A Case Study Approach to Successful Data Science Projects Using Python, Pandas, and Scikit-Learn
Stars: ✭ 198 (-9.59%)
Mutual labels:  scikit-learn
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-16.89%)
Mutual labels:  scikit-learn
Hummingbird
Hummingbird compiles trained ML models into tensor computation for faster inference.
Stars: ✭ 2,704 (+1134.7%)
Mutual labels:  scikit-learn
Skggm
Scikit-learn compatible estimation of general graphical models
Stars: ✭ 177 (-19.18%)
Mutual labels:  scikit-learn
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (-10.5%)
Mutual labels:  scikit-learn
Auto viml
Automatically Build Multiple ML Models with a Single Line of Code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Stars: ✭ 216 (-1.37%)
Mutual labels:  scikit-learn
Hydro Serving
MLOps Platform
Stars: ✭ 213 (-2.74%)
Mutual labels:  scikit-learn
Lale
Library for Semi-Automated Data Science
Stars: ✭ 198 (-9.59%)
Mutual labels:  scikit-learn

Clustering on the unit hypersphere in scikit-learn

Mixture of von Mises Fisher

Algorithms

This package implements the three algorithms outlined in "Clustering on the Unit Hypersphere using von Mises-Fisher Distributions", Banerjee et al., JMLR 2005, for scikit-learn.

  1. Spherical K-means (spkmeans)

    Spherical K-means differs from conventional K-means in that it projects the estimated cluster centroids onto the the unit sphere at the end of each maximization step (i.e., normalizes the centroids).

  2. Mixture of von Mises Fisher distributions (movMF)

    Much like the Gaussian distribution is parameterized by mean and variance, the von Mises Fisher distribution has a mean direction $\mu$ and a concentration parameter $\kappa$. Each point $x_i$ drawn from the vMF distribution lives on the surface of the unit hypersphere $\S^{N-1}$ (i.e., $\|x_i\|_2 = 1$) as does the mean direction $\|\mu\|_2 = 1$. Larger $\kappa$ leads to a more concentrated cluster of points.

    If we model our data as a mixture of von Mises Fisher distributions, we have an additional weight parameter $\alpha$ for each distribution in the mixture. The movMF algorithms estimate the mixture parameters via expectation-maximization (EM) enabling us to cluster data accordingly.

    • soft-movMF

      Estimates the real-valued posterior on each example for each class. This enables a soft clustering in the sense that we have a probability of cluster membership for each data point.

    • hard-movMF

      Sets the posterior on each example to be 1 for a single class and 0 for all others by selecting the location of the max value in the estimator soft posterior.

    Beyond estimating cluster centroids, these algorithms also jointly estimate the weights of each cluster and the concentration parameters. We provide an option to pass in (and override) weight estimates if they are known in advance.

    Label assigment is achieved by computing the argmax of the posterior for each example.

Relationship between spkmeans and movMF

Spherical k-means is a special case of both movMF algorithms.

  • If for each cluster we enforce all of the weights to be equal $\alpha_i = 1/n_clusters$ and all concentrations to be equal and infinite $\kappa_i \rightarrow \infty$, then soft-movMF behaves as spkmeans.

  • Similarly, if for each cluster we enforce all of the weights to be equal and all concentrations to be equal (with any value), then hard-movMF behaves as spkmeans.

Other goodies

  • A utility for sampling from a multivariate von Mises Fisher distribution in spherecluster/util.py.

Installation

Clone this repo and run

python setup.py install

or via PyPI

pip install spherecluster

The package requires that numpy and scipy are installed independently first.

Usage

Both SphericalKMeans and VonMisesFisherMixture are standard sklearn estimators and mirror the parameter names for sklearn.cluster.kmeans.

# Find K clusters from data matrix X (n_examples x n_features)

# spherical k-means
from spherecluster import SphericalKMeans
skm = SphericalKMeans(n_clusters=K)
skm.fit(X)

# skm.cluster_centers_
# skm.labels_
# skm.inertia_

# movMF-soft
from spherecluster import VonMisesFisherMixture
vmf_soft = VonMisesFisherMixture(n_clusters=K, posterior_type='soft')
vmf_soft.fit(X)

# vmf_soft.cluster_centers_
# vmf_soft.labels_
# vmf_soft.weights_
# vmf_soft.concentrations_
# vmf_soft.inertia_

# movMF-hard
from spherecluster import VonMisesFisherMixture
vmf_hard = VonMisesFisherMixture(n_clusters=K, posterior_type='hard')
vmf_hard.fit(X)

# vmf_hard.cluster_centers_
# vmf_hard.labels_
# vmf_hard.weights_
# vmf_hard.concentrations_
# vmf_hard.inertia_

The full set of parameters for the VonMisesFisherMixture class can be found here in the doc string for the class; see help(VonMisesFisherMixture).

Notes:

  • X can be a dense numpy.array or a sparse scipy.sparse.csr_matrix

  • VonMisesFisherMixture has been tested successfully with sparse documents of dimension n_features = 43256. When n_features is very large the algorithm may encounter numerical instability. This will likely be due to the scaling factor of the log-vMF distribution.

  • cluster_centers_ in VonMisesFisherMixture are dense vectors in current implementation

  • Mixture weights can be manually controlled (overriden) instead of learned.

Testing

From the base directory, run:

python -m pytest spherecluster/tests/

Examples

Small mix

We reproduce the "small mix" example from Section 6.3 in examples/small_mix.py. We've adjusted the parameters such that one distribution in the mixture has much lower concentration than the other to distinguish between movMF performance and (spherical) k-means which do not estimate weight or concentration parameters. We also provide a 3D version of this example in examples/small_mix_3d.py for fun.

Running these scripts will spit out some additional performance metrics for each algorithm.

Small mix 2d Small mix 3d

It is clear from the figures that the movMF algorithms do a better job by taking advantage of the concentration estimate.

Document clustering

We also reproduce this scikit-learn tfidf (w optional lsa) + k-means demo in examples/document_clustering.py. The results are different on each run, here's a chart comparing the algorithms' performances for a sample run:

Document clustering

Spherical k-means, which is a simple low-cost modification to the standard k-means algorithm performs quite well on this example.

References

Attribution

See also

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].