All Projects → Markus-Goetz → hpdbscan

Markus-Goetz / hpdbscan

Licence: MIT License
Highly parallel DBSCAN (HPDBSCAN)

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
CMake
9771 projects
python
139335 projects - #7 most used programming language
SWIG
194 projects

Projects that are alternatives of or similar to hpdbscan

point-cloud-clusters
A catkin workspace in ROS which uses DBSCAN to identify which points in a point cloud belong to the same object.
Stars: ✭ 43 (+126.32%)
Mutual labels:  clustering, dbscan
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (+100%)
Mutual labels:  clustering, dbscan
matrix multiplication
Parallel Matrix Multiplication Using OpenMP, Phtreads, and MPI
Stars: ✭ 41 (+115.79%)
Mutual labels:  openmp, mpi
gouda
Golang Utilities for Data Analysis
Stars: ✭ 18 (-5.26%)
Mutual labels:  clustering, dbscan
EFDCPlus
www.eemodelingsystem.com
Stars: ✭ 9 (-52.63%)
Mutual labels:  openmp, mpi
nbodykit
Analysis kit for large-scale structure datasets, the massively parallel way
Stars: ✭ 93 (+389.47%)
Mutual labels:  clustering, mpi
gpubootcamp
This repository consists for gpu bootcamp material for HPC and AI
Stars: ✭ 227 (+1094.74%)
Mutual labels:  openmp, mpi
Clustering-in-Python
Clustering methods in Machine Learning includes both theory and python code of each algorithm. Algorithms include K Mean, K Mode, Hierarchical, DB Scan and Gaussian Mixture Model GMM. Interview questions on clustering are also added in the end.
Stars: ✭ 27 (+42.11%)
Mutual labels:  clustering, dbscan
yask
YASK--Yet Another Stencil Kit: a domain-specific language and framework to create high-performance stencil code for implementing finite-difference methods and similar applications.
Stars: ✭ 81 (+326.32%)
Mutual labels:  openmp, mpi
analisis-numerico-computo-cientifico
Análisis numérico y cómputo científico
Stars: ✭ 42 (+121.05%)
Mutual labels:  openmp, mpi
text clustering
文本聚类(Kmeans、DBSCAN、LDA、Single-pass)
Stars: ✭ 230 (+1110.53%)
Mutual labels:  clustering, dbscan
tbslas
A parallel, fast solver for the scalar advection-diffusion and the incompressible Navier-Stokes equations based on semi-Lagrangian/Volume-Integral method.
Stars: ✭ 21 (+10.53%)
Mutual labels:  openmp, mpi
DBSCAN
c++ implementation of clustering by DBSCAN
Stars: ✭ 89 (+368.42%)
Mutual labels:  clustering, dbscan
libquo
Dynamic execution environments for coupled, thread-heterogeneous MPI+X applications
Stars: ✭ 21 (+10.53%)
Mutual labels:  openmp, mpi
Foundations of HPC 2021
This repository collects the materials from the course "Foundations of HPC", 2021, at the Data Science and Scientific Computing Department, University of Trieste
Stars: ✭ 22 (+15.79%)
Mutual labels:  openmp, mpi
ClusterAnalysis.jl
Cluster Algorithms from Scratch with Julia Lang. (K-Means and DBSCAN)
Stars: ✭ 22 (+15.79%)
Mutual labels:  clustering, dbscan
Primecount
🚀 Fast prime counting function implementations
Stars: ✭ 193 (+915.79%)
Mutual labels:  openmp, mpi
Abyss
🔬 Assemble large genomes using short reads
Stars: ✭ 219 (+1052.63%)
Mutual labels:  openmp, mpi
research-computing-with-cpp
UCL-RITS *C++ for Research* engineering course
Stars: ✭ 16 (-15.79%)
Mutual labels:  openmp, mpi
wxparaver
wxParaver is a trace-based visualization and analysis tool designed to study quantitative detailed metrics and obtain qualitative knowledge of the performance of applications, libraries, processors and whole architectures.
Stars: ✭ 23 (+21.05%)
Mutual labels:  openmp, mpi

HPDBSCAN

Highly parallel DBSCAN (HPDBSCAN) is a shared- and distributed-memory parallel implementation of the Density-Based Spatial Clustering for Applications with Noise (DBSCAN) algorithm. It is written in C++ and may be used as shared library and command line tool.

Dependencies

HPDBSCAN requires the following dependencies. Please make sure, that these are installed, before attempting to compile the code.

  • CMake 3.10+
  • C++11 compliant compiler (e.g. g++ 4.9+)
  • OpenMP 4.0+ (e.g. g++ 4.9+)
  • HDF5 1.8+
  • (optional) Message Passing Interface (MPI) 2.0+
  • (optional) SWIG 3.0+ (Python bindings with and without MPI support)
  • (optional) Python 3.5+ (Python bindings with and without MPI support, requires headers)
  • (optional) mpi4py (Python bindings with MPI support)

Compilation

HPDBSCAN follows the standard CMake project conventions. Create a build directory, change to it, generate the build script and compile it. A convenience short-hand can be found below.

mkdir build && cd build && cmake .. && make

The provided CMake script checks, but does not install, all of the necessary dependencies listed above. If no MPI installation is present on the system, an OpenMP-only (i.e. thread-based) version is built.

Usage

HPDBSCAN's command line usage flags are shown below. You may obtain the same message by invoking hpdbscan -h:

Highly parallel DBSCAN clustering algorithm
Usage:
  HPDBSCAN [OPTION...]

  -h, --help                this help message
  -m, --minPoints arg       density threshold (default: 2)
  -e, --epsilon arg         spatial search radius (default: 0.1)
  -t, --threads arg         utilized threads (default: 4)
  -i, --input arg           input file (default: data.h5)
  -o, --output arg          output file (default: data.h5)
      --input-dataset arg   input dataset name (default: DATA)
      --output-dataset arg  output dataset name (default: CLUSTERS)

The typical basic usage of HPDBSCAN is shown below. The above line is for single-node-only (e.g. your laptop/PC) execution. The line below shows a typical high-performance computing setup with multiple distributed nodes and processing cores per node. The data is passed to the application in form of an HDF5 file.

./hpdbscan -t <THREADS> <PATH_TO_HDF5_FILE>
mpirun -np <NODES> ./hpdbscan -t <THREADS> <PATH_TO_HDF5_FILE>

A second file will be created containing a single vector with the labels for each data point at the same index. The labels may be unintuitive at first. The value zero indicates a noise point, labels unequal to zero indicate a point belonging to a cluster. Negative cluster values are core points of the respective cluster with the same absolute value.

For example, a point might have the cluster label -3, indicating it belongs to the cluster with ID 3 and it is a core point. A second might have the cluster label 3, indicating that it also belongs to the cluster with the ID 3, however, it is not a core point of said cluster. Nevertheless, all points with either the cluster labels -3 or 3 belong to the same cluster with the ID 3.

Python Bindings

The CMake script will automatically build HPDBSCAN bindings for the Python programming language, if SWIG and Python (with headers) installation are found on the system. Additionally, if your Python installation also provides the mpi4py pip package, MPI support is enabled.

A small programming example is shown below:

import hpdbscan
import numpy as np

data = np.random.rand(100, 3)
clusterer = hpdbscan.HPDBSCAN(epsilon=0.3, min_points=4)
clusterer.cluster(data)
# alternatively
clusterer.cluster('file.h5', 'data')

The data passed to the binding is expected to be a two-dimensional numpy array, where the first axis is the sample dimension and the second axis the feature dimension. The result is returned as a tuple with the same lenghts as the first data dimension.

Should you want to use the MPI flavor of the binding, please ensure that each MPI rank only receives a disjoint subset of the entire dataset (e.g. equally-sized chunks). After the clustering process each rank will have the labels corresponding to the initially passed data items.

Benchmarks

Based on empirical benchmarks HPDBSCAN outperforms other DBSCAN implementation by a signficant margin in terms of execution time. One benchmarking review has been conducted by Helmut Neukirchen [1] for example. Beyond that, the repository contains a small benchmarking suite. If you want to redo them, please ensure that you have a Python 3.6+ interpreter and the pip packages numpy, pandas, sklearn, h5py, seaborn installed and execute the scripts in this order: 1. download_datasets.py, 2. benchmark.py and 3. plot.py.

Below you will find a figure with an execution of the benchmark suite. For each of the three presented datasets, 10 execution time measurement runs have been performed. The plot depicts the average of the execution time and the black bars the standard deviation, i.e. the execution time fluctuation. All runs have been performed on an single server with an Intel Xeon Gold 6126. The thread count was set to 24, the number of hardware cores, for both tools. The numpy used in sklearn is linked against the Intel MKL 2019.1.

Benchmark

The benchmark run for iris is not a mistake, it just executes subsecond for both tools and is therefore barely visible.

[1] Neukirchen, Helmut. Survey and performance evaluation of DBSCAN spatial clustering implementations for big data and high-performance computing paradigms. Technical report VHI-01-2016, Engineering Research Institute, University of Iceland, 2016.

Citation

If you wish to cite HPDBSCAN in your academic work, please use the following reference:

Plain reference

Götz, M., Bodenstein, C., Riedel M.,
HPDBSCAN: highly parallel DBSCAN,
Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, ACM, 2015.

BibTex

@inproceedings{gotz2015hpdbscan,
  title={HPDBSCAN: highly parallel DBSCAN},
  author={G{\"o}tz, Markus and Bodenstein, Christian and Riedel, Morris},
  booktitle={Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments},
  pages={2},
  year={2015},
  organization={ACM}
}

Contact

If you want to let us know about feature requests, bugs or issues you are kindly referred to the issue tracker.

For any other discussion, please write an e-mail.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].