All Projects → gagolews → genieclust

gagolews / genieclust

Licence: other
Genie++ Fast and Robust Hierarchical Clustering with Noise Point Detection - for Python and R

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
cython
566 projects
r
7636 projects
TeX
3793 projects
HTML
75241 projects

Projects that are alternatives of or similar to genieclust

genie
Genie: A Fast and Robust Hierarchical Clustering Algorithm (this R package has now been superseded by genieclust)
Stars: ✭ 21 (-38.24%)
Mutual labels:  data-mining, clustering, machine-learning-algorithms, data-analysis, genie, cluster-analysis, hierarchical-clustering-algorithm
Hdbscan
A high performance implementation of HDBSCAN clustering.
Stars: ✭ 2,032 (+5876.47%)
Mutual labels:  clustering, machine-learning-algorithms, cluster-analysis, clustering-algorithm
Clustering-in-Python
Clustering methods in Machine Learning includes both theory and python code of each algorithm. Algorithms include K Mean, K Mode, Hierarchical, DB Scan and Gaussian Mixture Model GMM. Interview questions on clustering are also added in the end.
Stars: ✭ 27 (-20.59%)
Mutual labels:  clustering, clustering-algorithm, hierarchical-clustering
Elki
ELKI Data Mining Toolkit
Stars: ✭ 613 (+1702.94%)
Mutual labels:  data-mining, clustering, data-analysis
clusters
Cluster analysis library for Golang
Stars: ✭ 68 (+100%)
Mutual labels:  clustering, cluster-analysis, clustering-algorithm
Spring2017 proffosterprovost
Introduction to Data Science
Stars: ✭ 18 (-47.06%)
Mutual labels:  data-mining, machine-learning-algorithms, data-analysis
online-course-recommendation-system
Built on data from Pluralsight's course API fetched results. Works with model trained with K-means unsupervised clustering algorithm.
Stars: ✭ 31 (-8.82%)
Mutual labels:  machine-learning-algorithms, data-analysis, clustering-algorithm
taller SparkR
Taller SparkR para las Jornadas de Usuarios de R
Stars: ✭ 12 (-64.71%)
Mutual labels:  data-mining, machine-learning-algorithms, data-analysis
Heart disease prediction
Heart Disease prediction using 5 algorithms
Stars: ✭ 43 (+26.47%)
Mutual labels:  data-mining, clustering, machine-learning-algorithms
Data mining
The Ruby DataMining Gem, is a little collection of several Data-Mining-Algorithms
Stars: ✭ 10 (-70.59%)
Mutual labels:  data-mining, clustering, machine-learning-algorithms
Model Describer
model-describer : Making machine learning interpretable to humans
Stars: ✭ 22 (-35.29%)
Mutual labels:  data-mining, machine-learning-algorithms, data-analysis
clustering-python
Different clustering approaches applied on different problemsets
Stars: ✭ 36 (+5.88%)
Mutual labels:  clustering, cluster-analysis, clustering-algorithm
hierarchical-clustering
A Python implementation of divisive and hierarchical clustering algorithms. The algorithms were tested on the Human Gene DNA Sequence dataset and dendrograms were plotted.
Stars: ✭ 62 (+82.35%)
Mutual labels:  data-mining, clustering, hierarchical-clustering
PracticalMachineLearning
A collection of ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free (as speech not free food) or open-source.
Stars: ✭ 60 (+76.47%)
Mutual labels:  data-mining, data-analysis
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+167.65%)
Mutual labels:  data-mining, clustering
heidi
heidi : tidy data in Haskell
Stars: ✭ 24 (-29.41%)
Mutual labels:  data-mining, data-analysis
pyclustertend
A python package to assess cluster tendency
Stars: ✭ 38 (+11.76%)
Mutual labels:  clustering, cluster-analysis
xgboost-smote-detect-fraud
Can we predict accurately on the skewed data? What are the sampling techniques that can be used. Which models/techniques can be used in this scenario? Find the answers in this code pattern!
Stars: ✭ 59 (+73.53%)
Mutual labels:  data-mining, machine-learning-algorithms
python-notebooks
A collection of Jupyter Notebooks used in conferences or just to have some snippets.
Stars: ✭ 14 (-58.82%)
Mutual labels:  data-mining, data-analysis
Loan-Approval-Prediction
Loan Application Data Analysis
Stars: ✭ 61 (+79.41%)
Mutual labels:  data-mining, data-analysis

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection

genieclust for Python genieclust for R codecov

Genie outputs meaningful clusters and is fast even on large data sets.

Documentation, tutorials, and benchmarks are available at https://genieclust.gagolewski.com/.

About

A faster and more powerful version of Genie - a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, and Cena, 2016), originally included in the R package genie.

The idea behind Genie is beautifully simple. First, make each individual point the only member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes a point group of the smallest size will sometimes be matched with its nearest neighbours.

Genie's appealing simplicity goes hand in hand with its usability; it often outperforms other clustering approaches such as K-means, BIRCH, or average, Ward, and complete linkage on benchmark data.

Genie is also very fast - determining the whole cluster hierarchy for datasets of millions of points can be completed within a coffee break. Therefore, it is perfectly suited for solving of extreme clustering tasks (large datasets with any number of clusters to detect) for data (also sparse) that fit into memory. Thanks to the use of nmslib, sparse or string inputs are also supported.

It also allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of HDBSCAN* (see Campello et al., 2015) that is able to detect a predefined number of clusters and hence it doesn't dependent on the DBSCAN's somewhat difficult-to-set eps parameter.

Author and Contributors

Author and maintainer: Marek Gagolewski

Contributors of the code from the original R package genie: Anna Cena, Maciej Bartoszuk

Computing of some partition similarity scores (namely, the normalised accuracy and pair sets index) is based on an implementation of the shortest augmenting path algorithm for the rectangular assignment problem contributed by Peter M. Larsen.

Python and R Package Features

The implemented algorithms include:

  • Genie++ - a reimplementation of the original Genie algorithm
    with a scikit-learn-compatible interface (Gagolewski et al., 2016); much faster than the original one; supports approximate disconnected MSTs;
  • Genie+HDBSCAN* - our robustified (Geniefied) retake on the HDBSCAN* (Campello et al., 2015) method that detects noise points in data and outputs clusters of predefined sizes;
  • (Python only, experimental preview) Genie+Ic (GIc) - Cena's (2018) algorithm to minimise the information theoretic criterion discussed by Mueller et al. (2012).

See classes genieclust.Genie and genieclust.GIc (Python) or functions gclust() and genieclust() (R).

Other goodies:

  • Inequity measures (the normalised Gini, Bonferroni, and De Vergottini index);
  • unctions to compare partitions (adjusted&unadjusted Rand, adjusted&unadjusted Fowlkes-Mallows (FM), adjusted&normalised&unadjusted mutual information (MI) scores, normalised accuracy and pair sets index (PSI));
  • (Python only) Union-find (disjoint sets) data structures (with extensions);
  • (Python only) Useful R-like plotting functions.

Examples, Tutorials, and Documentation

The Python language version of genieclust has a familiar scikit-learn-like look-and-feel:

import genieclust
X = ... # some data
g = genieclust.Genie(n_clusters=2)
labels = g.fit_predict(X)

R's interface is compatible with hclust(), but there is more.

X <- ... # some data
h <- gclust(X)
plot(h) # plot cluster dendrogram
cutree(h, k=2)
# or genie(X, k=2)

Check out the tutorials and the package documentation at https://genieclust.gagolewski.com/.

How to Install

Python Version

PyPI

To install via pip (see PyPI):

pip3 install genieclust

The package requires Python 3.7+ together with cython as well as numpy, scipy, matplotlib, nmslib, and scikit-learn. Optional dependency: mlpack.

R Version

CRAN

To install the most recent release, call:

install.packages("genieclust")

See the package entry on CRAN.

Other

Note that the core functionality is implemented in form of a header-only C++ library, so it might be relatively easily adapted for use in other environments.

Any contributions are welcome (e.g., Julia, Matlab, ...).

License

Copyright (C) 2018-2022 Marek Gagolewski (https://www.gagolewski.com)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License Version 3, 19 November 2007, published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License Version 3 for more details. You should have received a copy of the License along with this program. If not, see (https://www.gnu.org/licenses/).


The file src/c_scipy_rectangular_lsap.h is adapted from the scipy project (https://scipy.org/scipylib/), source: /scipy/optimize/rectangular_lsap/rectangular_lsap.cpp. Author: Peter M. Larsen. Distributed under the BSD-3-Clause license.

References

When using genieclust in research publications, please cite (Gagolewski, 2021) and (Gagolewski, Bartoszuk, Cena, 2016) as specified below. Thank you.

Gagolewski M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 2021, 100722. doi:10.1016/j.softx.2021.100722.

Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 2021, 620-636. doi:10.1016/j.ins.2021.10.004.

Cena A., Gagolewski M., Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages, Information Sciences 520, 2020, 324-336. doi:10.1016/j.ins.2020.02.025.

Cena A., Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences, 2018.

Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1), 2015, 5:1-5:51. doi:10.1145/2733381.

Mueller A., Nowozin S., Lampert C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.

See https://genieclust.gagolewski.com/ for more.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].