All Projects → wangyiqiu → dbscan-python

wangyiqiu / dbscan-python

Licence: MIT License
[New Version] Theoretically Efficient and Practical Parallel DBSCAN

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
CMake
9771 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to dbscan-python

point-cloud-clusters
A catkin workspace in ROS which uses DBSCAN to identify which points in a point cloud belong to the same object.
Stars: ✭ 43 (+138.89%)
Mutual labels:  clustering, dbscan, dbscan-clustering
Clustering-in-Python
Clustering methods in Machine Learning includes both theory and python code of each algorithm. Algorithms include K Mean, K Mode, Hierarchical, DB Scan and Gaussian Mixture Model GMM. Interview questions on clustering are also added in the end.
Stars: ✭ 27 (+50%)
Mutual labels:  clustering, dbscan, dbscan-clustering
ClusterAnalysis.jl
Cluster Algorithms from Scratch with Julia Lang. (K-Means and DBSCAN)
Stars: ✭ 22 (+22.22%)
Mutual labels:  clustering, dbscan, dbscan-clustering
text clustering
文本聚类(Kmeans、DBSCAN、LDA、Single-pass)
Stars: ✭ 230 (+1177.78%)
Mutual labels:  clustering, dbscan
hpdbscan
Highly parallel DBSCAN (HPDBSCAN)
Stars: ✭ 19 (+5.56%)
Mutual labels:  clustering, dbscan
gouda
Golang Utilities for Data Analysis
Stars: ✭ 18 (+0%)
Mutual labels:  clustering, dbscan
st dbscan
ST-DBSCAN: Simple and effective tool for spatial-temporal clustering
Stars: ✭ 82 (+355.56%)
Mutual labels:  clustering, dbscan-clustering
DBSCAN
c++ implementation of clustering by DBSCAN
Stars: ✭ 89 (+394.44%)
Mutual labels:  clustering, dbscan
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (+111.11%)
Mutual labels:  clustering, dbscan
Spectre
A computational toolkit in R for the integration, exploration, and analysis of high-dimensional single-cell cytometry and imaging data.
Stars: ✭ 31 (+72.22%)
Mutual labels:  clustering
genie
Genie: A Fast and Robust Hierarchical Clustering Algorithm (this R package has now been superseded by genieclust)
Stars: ✭ 21 (+16.67%)
Mutual labels:  clustering
hclust
Hierarchical clustering in JavaScript
Stars: ✭ 39 (+116.67%)
Mutual labels:  clustering
watchman
Watchman: An open-source social-media event-detection system
Stars: ✭ 18 (+0%)
Mutual labels:  clustering
Fred
A fast, scalable and light-weight C++ Fréchet distance library, exposed to python and focused on (k,l)-clustering of polygonal curves.
Stars: ✭ 13 (-27.78%)
Mutual labels:  clustering
DBSCANSD
Java implementation for DBSCANSD, a trajectory clustering algorithm.
Stars: ✭ 35 (+94.44%)
Mutual labels:  clustering
Python-Machine-Learning-Fundamentals
D-Lab's 6 hour introduction to machine learning in Python. Learn how to perform classification, regression, clustering, and do model selection using scikit-learn and TPOT.
Stars: ✭ 46 (+155.56%)
Mutual labels:  clustering
hotspot3d
3D hotspot mutation proximity analysis tool
Stars: ✭ 43 (+138.89%)
Mutual labels:  clustering
ST-DBSCAN
Implementation of ST-DBSCAN algorithm based on Birant 2007
Stars: ✭ 25 (+38.89%)
Mutual labels:  dbscan
kmeans
K-Means clustering
Stars: ✭ 51 (+183.33%)
Mutual labels:  clustering
pytorch kmeans
Implementation of the k-means algorithm in PyTorch that works for large datasets
Stars: ✭ 38 (+111.11%)
Mutual labels:  clustering

Overview

This repository hosts fast parallel DBSCAN clustering code for low dimensional Euclidean space. The code automatically uses the available threads on a parallel shared-memory machine to speedup DBSCAN clustering. It stems from a paper presented in SIGMOD'20: Theoretically Efficient and Practical Parallel DBSCAN.

Our software on 1 thread is on par with all serial state-of-the-art DBSCAN packages, and provides additional speedup via multi-threading. Below, we show a simple benchmark comparing our code with the DBSCAN implementation of Sklearn, tested on a 4-core computer, and a visualization of the clustering result. The time saved will be more significant on a larger data set and a machine with more cores.

Data sets with dimensionality 2 - 20 are supported by default, which can be modified by modifying Caller::computeDBSCAN in the source code.

timing example

Tutorial

Option 1: Use the binary executable

Compile and run the program:

mkdir build
cd build
cmake ..
cd executable
make -j # this will take a while
./dbscan -eps 0.1 -minpts 10 -o clusters.txt <data-file>

The <data-file> can be any CSV-like point data file, where each line contains a data point -- see an example here. The data file can be either with or without header. The cluster output clusters.txt will contain a cluster ID on each line (other than the first-line header), giving a cluster assignment in the same ordering as the input file. A noise point will have a cluster ID of -1.

Option 2: Use the Python binding (experimental)

We are developing a Python wrapper using Cython. Currently it is still experimental, as it is only tested to work on a fresh copy of Ubuntu 20.04 (x86_64 with Python 3.8+). There are two ways to install it:

  • Compile it yourself: First install dependencies pip3 install -r src/requirements.txt and sudo apt install libpython3-dev. Navigate to src/, and run the Make script ./make.sh, The compilation will take a few minutes, and generate a .so library containing the DBSCAN module.
  • OR Install it using PyPI: pip3 install --user dbscan (the latest verion is 0.0.9)

An example for using the Python module is provided in src/example.py. If the dependencies above are installed, simply run python3 example.py from src/ to reproduce the plots above.

We plan to support more Python versions and environments, and we welcome any contributions that achieve these.

Python API

from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)
Input
  • X: A 2-D Numpy array (dtype=np.float64) containing the input data points. The first dimension of X is the number of data points n, and the second dimension is the data set dimensionality (the maximum supported dimensionality is 20).
  • eps: The epsilon parameter (default 0.5).
  • min_samples: The minPts parameter (default 5).
Output
  • labels: A length n Numpy array (dtype=np.int32) containing cluster IDs of the data points, in the same ordering as the input data. Noise points are given a pseudo-ID of -1.
  • core_samples_mask: A length n Numpy array (dtype=np.bool) masking the core points, in the same ordering as the input data.

We provide a complete example below that generates a toy data set, computes the DBSCAN clustering, and visualizes the result as shown in the plot above. Note that before running the example, the dependencies in src/requirements.txt need to be installed first.

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]
    class_member_mask = (labels == k)
    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)
    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Citation

If you use our work in a publication, we would appreciate citations:

@inproceedings{wang2020theoretically,
  author = {Wang, Yiqiu and Gu, Yan and Shun, Julian},
  title = {Theoretically-Efficient and Practical Parallel DBSCAN},
  year = {2020},
  isbn = {9781450367356},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3318464.3380582},
  doi = {10.1145/3318464.3380582},
  booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
  pages = {2555–2571},
  numpages = {17},
  keywords = {parallel algorithms, spatial clustering, DBScan},
  location = {Portland, OR, USA},
  series = {SIGMOD ’20}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].