All Projects → nmonath → graphgrove

nmonath / graphgrove

Licence: Apache-2.0 License
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to graphgrove

Clustering-in-Python
Clustering methods in Machine Learning includes both theory and python code of each algorithm. Algorithms include K Mean, K Mode, Hierarchical, DB Scan and Gaussian Mixture Model GMM. Interview questions on clustering are also added in the end.
Stars: ✭ 27 (-6.9%)
Mutual labels:  clustering, hierarchical-clustering
genieclust
Genie++ Fast and Robust Hierarchical Clustering with Noise Point Detection - for Python and R
Stars: ✭ 34 (+17.24%)
Mutual labels:  clustering, hierarchical-clustering
adventures-with-ann
All the code for a series of Medium articles on Approximate Nearest Neighbors
Stars: ✭ 40 (+37.93%)
Mutual labels:  nearest-neighbor-search, nearest-neighbors
Lopq
Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.
Stars: ✭ 530 (+1727.59%)
Mutual labels:  clustering, nearest-neighbor-search
Smile
Statistical Machine Intelligence & Learning Engine
Stars: ✭ 5,412 (+18562.07%)
Mutual labels:  clustering, nearest-neighbor-search
pynanoflann
Unofficial python wrapper to the nanoflann k-d tree
Stars: ✭ 24 (-17.24%)
Mutual labels:  nearest-neighbor-search, nearest-neighbors
R-stats-machine-learning
Misc Statistics and Machine Learning codes in R
Stars: ✭ 33 (+13.79%)
Mutual labels:  clustering, nearest-neighbors
hierarchical-clustering
A Python implementation of divisive and hierarchical clustering algorithms. The algorithms were tested on the Human Gene DNA Sequence dataset and dendrograms were plotted.
Stars: ✭ 62 (+113.79%)
Mutual labels:  clustering, hierarchical-clustering
LIUM
Scripts for LIUM SpkDiarization tools
Stars: ✭ 28 (-3.45%)
Mutual labels:  clustering
rabbitmq-peer-discovery-aws
AWS-based peer discovery backend for RabbitMQ 3.7.0+
Stars: ✭ 23 (-20.69%)
Mutual labels:  clustering
dropClust
Version 2.1.0 released
Stars: ✭ 19 (-34.48%)
Mutual labels:  clustering
lannister
A lightweight MQTT broker w/ full spec,Clustering,WebSocket,SSL written in Java
Stars: ✭ 20 (-31.03%)
Mutual labels:  clustering
treecut
Find nodes in hierarchical clustering that are statistically significant
Stars: ✭ 26 (-10.34%)
Mutual labels:  clustering
BPRMeth
Modelling DNA methylation profiles
Stars: ✭ 18 (-37.93%)
Mutual labels:  clustering
dbscan-python
[New Version] Theoretically Efficient and Practical Parallel DBSCAN
Stars: ✭ 18 (-37.93%)
Mutual labels:  clustering
kdtree
A k-d tree implementation in Go.
Stars: ✭ 98 (+237.93%)
Mutual labels:  nearest-neighbor-search
Unsupervised-Learning-in-R
Workshop (6 hours): Clustering (Hdbscan, LCA, Hopach), dimension reduction (UMAP, GLRM), and anomaly detection (isolation forests).
Stars: ✭ 34 (+17.24%)
Mutual labels:  clustering
sentences-similarity-cluster
Calculate similarity of sentences & Cluster the result.
Stars: ✭ 14 (-51.72%)
Mutual labels:  hierarchical-clustering
realtimemap-dotnet
A showcase for Proto.Actor - an ultra-fast distributed actors solution for Go, C#, and Java/Kotlin.
Stars: ✭ 47 (+62.07%)
Mutual labels:  clustering
ex united
Easily spawn Elixir nodes (supervising, Mix configured, easy asserted / refuted) within ExUnit tests
Stars: ✭ 40 (+37.93%)
Mutual labels:  clustering

Install

Linux wheels available (python >=3.6) on pypi:

pip install graphgrove

Building from source:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make

To build your own wheel:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make
pip install build
python -m build --wheel
# which can be used as:
# pip install --force dist/graphgrove-0.0.1-cp37-cp37m-linux_x86_64.whl 

Examples

Toy examples of clustering, DAG-structured clustering, and nearest neighbor search are available.

At a high level, incremental clustering can be done as:

import graphgrove as gg
k = 5
num_rounds = 50
thresholds = np.geomspace(1.0, 0.001, num_rounds).astype(np.float32)
scc = gg.vec_scc.Cosine_SCC(k=k, num_rounds=num_rounds, thresholds=thresholds, index_name='cosine_sgtree', cores=cores, verbosity=0)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    scc.partial_fit(batch)

Incremental nearest neighbor search can be done as:

import graphgrove as gg
k=5
cores=4
tree = gg.graph_builder.Cosine_SGTree(k=k, cores=cores)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    tree.insert(batch) # or tree.insert_and_knn(batch) 

Algorithms Implemented

Clustering:

  • Sub-Cluster Component Algorithm (SCC) and its minibatch variant from the paper: Scalable Hierarchical Agglomerative Clustering. Nicholas, Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork Mert Terzihan Bryon Tjanaka Yuan Wang Yuchen Wu. KDD. 2021
  • DAG Structured clustering (LLama) from DAG-Structured Clustering by Nearest Neighbors. Nicholas Monath, Manzil Zaheer, Kumar Avinava Dubey, Amr Ahmed, Andrew McCallum. AISTATS 2021.

Nearest Neighbor Search:

  • CoverTree: Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." ICML. 2006.
  • SGTree: SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets. Manzil Zaheer, Guru Guruganesh, Golan Levin, Alexander Smola. TerraPattern: A Nearest Neighbor Search Service. 2019.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].