All Projects → gibranfp → Sampled-MinHashing

gibranfp / Sampled-MinHashing

Licence: other
A method to mine beyond-pairwise relationships using Min-Hashing for large-scale pattern discovery

Programming Languages

c
50402 projects - #5 most used programming language
python
139335 projects - #7 most used programming language
SWIG
194 projects
CMake
9771 projects

Projects that are alternatives of or similar to Sampled-MinHashing

opensvc
The OpenSVC node agent
Stars: ✭ 27 (+12.5%)
Mutual labels:  clustering, discovery
k-means-quantization-js
🎨 Apply color quantization to images using k-means clustering.
Stars: ✭ 27 (+12.5%)
Mutual labels:  clustering
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (+58.33%)
Mutual labels:  clustering
Clustering-Datasets
This repository contains the collection of UCI (real-life) datasets and Synthetic (artificial) datasets (with cluster labels and MATLAB files) ready to use with clustering algorithms.
Stars: ✭ 189 (+687.5%)
Mutual labels:  clustering
Apartment-Interest-Prediction
Predict people interest in renting specific NYC apartments. The challenge combines structured data, geolocalization, time data, free text and images.
Stars: ✭ 17 (-29.17%)
Mutual labels:  clustering
RATTLE
Reference-free reconstruction and error correction of transcriptomes from Nanopore long-read sequencing
Stars: ✭ 35 (+45.83%)
Mutual labels:  clustering
GrouProx
FedGroup, A Clustered Federated Learning framework based on Tensorflow
Stars: ✭ 20 (-16.67%)
Mutual labels:  clustering
G-SimCLR
This is the code base for paper "G-SimCLR : Self-Supervised Contrastive Learning with Guided Projection via Pseudo Labelling" by Souradip Chakraborty, Aritra Roy Gosthipaty and Sayak Paul.
Stars: ✭ 69 (+187.5%)
Mutual labels:  clustering
QuestionClustering
Clasificador de preguntas escrito en python 3 que fue implementado en el siguiente vídeo: https://youtu.be/qnlW1m6lPoY
Stars: ✭ 15 (-37.5%)
Mutual labels:  clustering
faythe
An experimental cluster brings Prometheus and OpenStack together
Stars: ✭ 18 (-25%)
Mutual labels:  clustering
Zeitline
A polylinear timeline with clustering, centred on interactions. — Doc and demo https://octree-gva.github.io/Zeitline/
Stars: ✭ 15 (-37.5%)
Mutual labels:  clustering
py-lbg
Python Implementation for Linde-Buzo-Gray / Generalized Lloyd Algorithm for vector quantization.
Stars: ✭ 22 (-8.33%)
Mutual labels:  clustering
rabbitmq-peer-discovery-etcd
etcd-based peer discovery backend for RabbitMQ 3.7.0+
Stars: ✭ 15 (-37.5%)
Mutual labels:  clustering
product-quantization
🙃Implementation of vector quantization algorithms, codes for Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search.
Stars: ✭ 40 (+66.67%)
Mutual labels:  clustering
whichpm
Locates installed Perl modules.
Stars: ✭ 20 (-16.67%)
Mutual labels:  discovery
ClusterAnalysis.jl
Cluster Algorithms from Scratch with Julia Lang. (K-Means and DBSCAN)
Stars: ✭ 22 (-8.33%)
Mutual labels:  clustering
minicore
Fast and memory-efficient clustering + coreset construction, including fast distance kernels for Bregman and f-divergences.
Stars: ✭ 28 (+16.67%)
Mutual labels:  clustering
mongodb-cluster
MongoDB sharded cluster
Stars: ✭ 25 (+4.17%)
Mutual labels:  clustering
coredns-dockerdiscovery
Docker Discovery Plugin for CoreDNS
Stars: ✭ 36 (+50%)
Mutual labels:  discovery
protoactor-go
Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
Stars: ✭ 4,138 (+17141.67%)
Mutual labels:  clustering

Sampled-MinHashing

Sampled Min-Hashing (SMH) is a simple and scalable method to discover patterns from large-scale dyadic data (e.g. bag of words). SMH relies on Min-Hashing to efficiently mine beyond-pairwise relationships which are clustered to form the final discovered patterns. SMH has been successfully applied to the discovery of objects from image collections and topics from text corpora. This repository includes a C implementation of SMH together with SWIG Python bindings.

Installation

Install the dependencies:

sudo apt-get install cmake python swig libpython-dev

Clone and coompile the library:

git clone https://github.com/gibranfp/Sampled-MinHashing.git
cd Sampled-MinHashing
mkdir build
cd build
cmake ..
make

To do a system-wide installation

sudo make install

Alternatively, you can use it locally by adding the absolute path of the bin directory inside Sampled-MinHashing to the system path:

export PATH=$PATH:[absolute_path_to_sampled_minhashing]/bin

And the absolute path of the python/smh directory inside build to Python's path:

export PYTHONPATH=$PYTHONPATH:[absolute_path_to_sampled_minhashing]/build/python/smh

To uninstall the library from your system do:

sudo make uninstall

Example Usage

Getting NIPS Corpus

To discover topics from the NIPS corpus using Sampled-MinHashing, first download and extract the corpus to a given location:

wget http://arbylon.net/projects/nips/nips-20110223.zip
unzip nips-20110223.zip

This creates the directory knowceans-ilda/nips where the corpus is located. The file nips.corpus inside this directory contains a database of N lists corresponding to the bag-of-words of the N documents in the corpus. The format of the file is as follows:

size_of_list_1 item1_1:freq1_1 item2_1:freq2_1 ...
size_of_list_1 item1_2:freq1_22 item2_2:freq2_2 ...
...                        ...
size_of_list_N item1_N:freq1_N item2_N:freq2_N ...

For example, if you have a corpus of 5 documents with a vocabulary of 19 different term, the file could look like this:

6 3:9 4:8 7:5 12:1 16:5 18:5 
3 2:7 3:4 8:5
4 1:9 2:10 16:8 17:10
4 10:10 11:4 15:8 16:3
3 0:1 14:9 15:10

Creating Inverted File Structure

To perform topic discovery you'll need to load the corpus and create the inverted file structure. This can be done using the standalone command smhcmd:

smhcmd ifindex knowceans-ilda/nips/nips.corpus knowceans-ilda/nips/nips.ifs

Or from Python:

import smh
corpus = smh.listdb_load('knowceans-ilda/nips/nips.corpus')
ifs = corpus.invert()
ifs.save('knowceans-ilda/nips/nips.ifs')

Discovering Topics

Once you have the inverted file, to discover topics from the standalone smhcmd command you need to do

smhcmd discover~/knowceans-ilda/nips/nips.ifs ~/knowceans-ilda/nips/nips.models

From Python:

import smh
corpus = smh.listdb_load('knowceans-ilda/nips/nips.corpus')
ifs = smh.listdb_load('knowceans-ilda/nips/nips.ifs')
discoverer = smh.SMHDiscoverer()
models = discoverer.fit(ifs, expand = corpus)
models.save('knowceans-ilda/nips/nips.models')

To visualize the topics as sets of terms, load the vocabulary file and map term IDs to terms:

vocabulary = {}
with open('knowceans-ilda/nips/nips.vocab', 'r') as f:
	content = f.readlines()
	for line in content:
        	tokens = line.split(' = ')
        	vocabulary[int(tokens[1])] = tokens[0]

topics = []
for m in models.ldb:
	terms = []
	for j in m:
        	terms.append(vocabulary[j.item])
	topics.append(terms)

Finally save the lists of terms to a file:

with open('knowceans-ilda/nips/nips.terms', 'w') as f:
	for t in topics:
		f.write(' '.join(t).encode('utf8'))
		f.write('\n'.encode('utf8'))

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].