All Projects → DwangoMediaVillage → Pqkmeans

DwangoMediaVillage / Pqkmeans

Licence: mit
Fast and memory-efficient clustering

Projects that are alternatives of or similar to Pqkmeans

Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+1062.43%)
Mutual labels:  jupyter-notebook, scikit-learn, clustering
Python Clustering Exercises
Jupyter Notebook exercises for k-means clustering with Python 3 and scikit-learn
Stars: ✭ 153 (-19.05%)
Mutual labels:  jupyter-notebook, scikit-learn, clustering
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+498.94%)
Mutual labels:  jupyter-notebook, scikit-learn, clustering
Practical Machine Learning With Python
Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.
Stars: ✭ 1,868 (+888.36%)
Mutual labels:  jupyter-notebook, scikit-learn, clustering
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+702.12%)
Mutual labels:  jupyter-notebook, scikit-learn, clustering
Ml Forex Prediction
Predicting Forex Future Price with Machine Learning
Stars: ✭ 142 (-24.87%)
Mutual labels:  jupyter-notebook, scikit-learn
Python Machine Learning Book
The "Python Machine Learning (1st edition)" book code repository and info resource
Stars: ✭ 11,428 (+5946.56%)
Mutual labels:  jupyter-notebook, scikit-learn
Hands On Machine Learning With Scikit Learn Keras And Tensorflow
Notes & exercise solutions of Part I from the book: "Hands-On ML with Scikit-Learn, Keras & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems" by Aurelien Geron
Stars: ✭ 151 (-20.11%)
Mutual labels:  jupyter-notebook, scikit-learn
Qlik Py Tools
Data Science algorithms for Qlik implemented as a Python Server Side Extension (SSE).
Stars: ✭ 135 (-28.57%)
Mutual labels:  scikit-learn, clustering
Hdbscan
A high performance implementation of HDBSCAN clustering.
Stars: ✭ 2,032 (+975.13%)
Mutual labels:  jupyter-notebook, clustering
Cheatsheets.pdf
📚 Various cheatsheets in PDF
Stars: ✭ 159 (-15.87%)
Mutual labels:  jupyter-notebook, scikit-learn
Py4chemoinformatics
Python for chemoinformatics
Stars: ✭ 140 (-25.93%)
Mutual labels:  jupyter-notebook, scikit-learn
Python Machine Learning Book 3rd Edition
The "Python Machine Learning (3rd edition)" book code repository
Stars: ✭ 2,883 (+1425.4%)
Mutual labels:  jupyter-notebook, scikit-learn
Ml Workspace
🛠 All-in-one web-based IDE specialized for machine learning and data science.
Stars: ✭ 2,337 (+1136.51%)
Mutual labels:  jupyter-notebook, scikit-learn
Interactive machine learning
IPython widgets, interactive plots, interactive machine learning
Stars: ✭ 140 (-25.93%)
Mutual labels:  jupyter-notebook, scikit-learn
Machine Learning And Reinforcement Learning In Finance
Machine Learning and Reinforcement Learning in Finance New York University Tandon School of Engineering
Stars: ✭ 173 (-8.47%)
Mutual labels:  jupyter-notebook, scikit-learn
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-3.7%)
Mutual labels:  jupyter-notebook, scikit-learn
Clustergrammer
An interactive heatmap visualization built using D3.js
Stars: ✭ 188 (-0.53%)
Mutual labels:  jupyter-notebook, clustering
Hep ml
Machine Learning for High Energy Physics.
Stars: ✭ 133 (-29.63%)
Mutual labels:  jupyter-notebook, scikit-learn
Machine Learning Projects
This repository consists of all my Machine Learning Projects.
Stars: ✭ 135 (-28.57%)
Mutual labels:  jupyter-notebook, clustering

PQk-means

Project | Paper | Tutorial

A 2D example using both k-means and PQk-means Large-scale evaluation

PQk-means [Matsui, Ogaki, Yamasaki, and Aizawa, ACMMM 17] is a Python library for efficient clustering of large-scale data. By first compressing input vectors into short product-quantized (PQ) codes, PQk-means achieves fast and memory-efficient clustering, even for high-dimensional vectors. Similar to k-means, PQk-means repeats the assignment and update steps, both of which can be performed in the PQ-code domain.

For a comparison, we provide the ITQ encoding for the binary conversion and Binary k-means [Gong+, CVPR 15] for the clustering of binary codes.

The library is written in C++ for the main algorithm with wrappers for Python. All encoding/clustering codes are compatible with scikit-learn.

Summary of features

  • Approximation of k-means
  • Tens to hundreds of times faster than k-means
  • Tens to hundreds of times more memory efficient than k-means
  • Compatible with scikit-learn
  • Portable; one-line installation

Installation

Requisites

  • CMake
    • brew install cmake for OS X
    • sudo apt install cmake for Ubuntu
  • OpenMP (Optional)
    • If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.

Build & install

You can install the library from PyPI:

pip install pqkmeans

Or, if you would like to use the current master version, you can manually build and install the library by:

git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git
cd pqkmeans
python setup.py install

Run samples

# with artificial data
python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100
# with texmex dataset (http://corpus-texmex.irisa.fr/)
python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100

Test

python setup.py test

Usage

For PQk-means

import pqkmeans
import numpy as np
X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples

# Train a PQ encoder.
# Each vector is divided into 4 parts and each part is
# encoded with log256 = 8 bit, resulting in a 32 bit PQ code.
encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder.fit(X[:1000])  # Use a subset of X for training

# Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8.
# You can train the encoder and transform the input vectors to PQ codes preliminary.
X_pqcode = encoder.transform(X)

# Run clustering with k=5 clusters.
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5)
clustered = kmeans.fit_predict(X_pqcode)

# Then, clustered[0] is the id of assigned center for the first input PQ code (X_pqcode[0]).

Note that an instance of PQ-encoder (encoder) and an instance of clustering (kmeans) can be pickled and reused later.

import pickle

# An instance of PQ-encoder.
pickle.dump(encoder, open('encoder.pkl', 'wb'))
encoder_dumped = pickle.load(open('encoder.pkl', 'rb'))

# An instance of clustering. This can be reused as a vector quantizer later.
pickle.dump(kmeans, open('kmeans.pkl', 'wb'))
kmeans_dumped = pickle.load(open('kmeans.pkl', 'rb'))

For Bk-means

In almost the same manner as for PQk-means,

import pqkmeans
import numpy as np
X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples

# Train an ITQ binary encoder
encoder = pqkmeans.encoder.ITQEncoder(num_bit=32)
encoder.fit(X[:1000])  # Use a subset of X for training

# Convert input vectors to binary codes
X_itq = encoder.transform(X)

# Run clustering
kmeans = pqkmeans.clustering.BKMeans(k=5, input_dim=32)
clustered = kmeans.fit_predict(X_itq)

Please see more examples on a tutorial

Note

  • This repository contains the re-implemented version of the PQk-means with the Python interface. There can be the difference between this repository and the pure c++ implementation used in the paper.
  • We tested this library with Python3, on OS X and Ubuntu 16.04.

Authors

  • Keisuke Ogaki designed the whole structure of the library, and implemented most of the Bk-means clustering
  • Yusuke Matsui implemented most of the PQk-means clustering

Reference

@inproceedings{pqkmeans,
    author = {Yusuke Matsui and Keisuke Ogaki and Toshihiko Yamasaki and Kiyoharu Aizawa},
    title = {PQk-means: Billion-scale Clustering for Product-quantized Codes},
    booktitle = {ACM International Conference on Multimedia (ACMMM)},
    year = {2017},
}

Todo

  • Evaluation script for billion-scale data
  • Nearest neighbor search with PQTable
  • Documentation
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].