go-nlp / dmmclust

Licence: MIT license

dmmclust is a package for clustering short texts, based on Yin and Wang (2014)

Programming Languages

31211 projects - #10 most used programming language

Projects that are alternatives of or similar to dmmclust

Mars

Asynchronous Block-Level Storage Replication

Stars: ✭ 168 (+630.43%)

Mutual labels: clustering

Clustergrammer

An interactive heatmap visualization built using D3.js

Stars: ✭ 188 (+717.39%)

Mutual labels: clustering

Gemsec

The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).

Stars: ✭ 210 (+813.04%)

Mutual labels: clustering

Flexsearch Server

High-performance FlexSearch Server for Node.js (Cluster)

Stars: ✭ 172 (+647.83%)

Mutual labels: clustering

Dcc

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

Stars: ✭ 179 (+678.26%)

Mutual labels: clustering

Timeseries Clustering Vae

Variational Recurrent Autoencoder for timeseries clustering in pytorch

Stars: ✭ 190 (+726.09%)

Mutual labels: clustering

Newsrecommender

A news recommendation system tailored for user communities

Stars: ✭ 164 (+613.04%)

Mutual labels: clustering

Clustering With Deep Learning

Generic implementation for clustering with deep learning : representation learning (DNN) + clustering

Stars: ✭ 236 (+926.09%)

Mutual labels: clustering

Dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW

Stars: ✭ 185 (+704.35%)

Mutual labels: clustering

Keras deep clustering

How to do Unsupervised Clustering with Keras

Stars: ✭ 202 (+778.26%)

Mutual labels: clustering

Micro Cluster

Run multiple micro servers and a front proxy at a time

Stars: ✭ 173 (+652.17%)

Mutual labels: clustering

Gsdmm

GSDMM: Short text clustering

Stars: ✭ 175 (+660.87%)

Mutual labels: clustering

Uci Ml Api

Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)

Stars: ✭ 190 (+726.09%)

Mutual labels: clustering

Rayo.js

Micro framework for Node.js

Stars: ✭ 170 (+639.13%)

Mutual labels: clustering

Spectralcluster

Python re-implementation of the spectral clustering algorithm in the paper "Speaker Diarization with LSTM"

Stars: ✭ 220 (+856.52%)

Mutual labels: clustering

Slot Attention

Implementation of Slot Attention from GoogleAI

Stars: ✭ 168 (+630.43%)

Mutual labels: clustering

Pqkmeans

Fast and memory-efficient clustering

Stars: ✭ 189 (+721.74%)

Mutual labels: clustering

Orange3

🍊 📊 💡 Orange: Interactive data analysis

Stars: ✭ 3,152 (+13604.35%)

Mutual labels: clustering

Clustering.jl

A Julia package for data clustering

Stars: ✭ 227 (+886.96%)

Mutual labels: clustering

Vectorai

Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Stars: ✭ 195 (+747.83%)

Mutual labels: clustering

View All Similar Projects ➔

DMMClust

package dmmclust is a package that provides functions for clustering small texts as described by Yin and Wang (2014) in A Dirichlet Multinomial Mixture Model based Approach for Short Text Clustering.

The clustering algorithm is remarkably elegant and simple, leading to a very minimal implementation. This package also exposes some types to allow for extensibility.

Installing

go get -u github.com/go-nlp/dmmclust.

This package also provides a Gopkg.toml file for dep users.

This package uses SemVer 2.0 for versioning, and all releases are tagged.

How To Use

func main(){
	docs := getDocs()
	corp := getCorpus(docs)
	conf := dmmclust.Config{
		K:          10,                   // maximum 10 clusters expected
		Vocabulary: len(corp),            // simple example: the vocab is the same as the corpus size
		Iter:       100,                  // iterate 100 times
		Alpha:      0.0001,               // smaller probability of joining an empty group
		Beta:       0.1,                  // higher probability of joining groups like me
		Score:      dmmclust.Algorithm3,  // use Algorithm3 to score
		Sample:     dmmclust.Gibbs, // use Gibbs to sample
	}

	var clustered []dmmclust.Cluster // len(clustered) == len(docs)
	var err error
	if clustered, err = dmmclust.FindClusters(docs, conf); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Clusters:")
	for i, clust := range clustered {
		fmt.Printf("\t%d: %q\n", clust.ID(), data[i])
	}
}

Hyperparameters

K represents the maximum number of clusters expected. The final number of clusters can never exceed K.
Alpha represents the probability of joining an empty group. If Alpha is 0.0 then once a group is empty, it'll stay empty for the rest of the
Beta represents the probability of joining groups that are similar. If Beta is 0.0, then a document will never join a group if there are no common words between the groups and the documents. In some cases this is preferable (highly preprocessed inputs for example).

Playing Well With Other Packages

This package was originally built to play well with lingo. It's why it works on slices of integers. That's the only preprocessing necessary - converting a sentence into a slice of ints.

The Document interface is defined as:

type Document interface {
	TokenSet() TokenSet
	Len() int
}

TokenSet is simply a []int, where each ith element represents the word ID of a corpus. The order is not important in the provided algorithms (Algorithm3 and Algorithm4), but may be important in some other scoring function.

Extensibility

This package defines a Scoring Function as type ScoringFn func(doc Document, docs []Document, clusters []Cluster, conf Config) []float64. This allows for custom scoring functions to be used.

There are two scoring algorithms provided: Algorithm3 and Algorithm4. I've been successful at using other scoring algorithms as well.

The sampling function is also customizable. The default is to use Gibbs. I've not had much success at other sampling algorithms.

Contributing

To contribute to this package, simply file an issue, discuss and then send a pull request. Please ensure that tests are provided in any changes.

Licence

This package is MIT licenced.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

go-nlp / dmmclust

Programming Languages

Labels

Projects that are alternatives of or similar to dmmclust

DMMClust

Installing

How To Use

Hyperparameters

Playing Well With Other Packages

Extensibility

Contributing

Licence