All Projects → jonsafari → Clustercat

jonsafari / Clustercat

Licence: other
Fast Word Clustering Software

Programming Languages

python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to Clustercat

D3 In Motion
Code examples and references for the course "D3.js in Motion"
Stars: ✭ 37 (-43.08%)
Mutual labels:  d3js
Dual Scale D3 Bar Chart
This is a demo for creating dual-scaled bar charts using D3.js
Stars: ✭ 49 (-24.62%)
Mutual labels:  d3js
Textblob Ar
Arabic support for textblob
Stars: ✭ 60 (-7.69%)
Mutual labels:  word-embeddings
Embeddingsviz
Visualize word embeddings of a vocabulary in TensorBoard, including the neighbors
Stars: ✭ 40 (-38.46%)
Mutual labels:  word-embeddings
D3 Parliament
A parliament chart based on D3js
Stars: ✭ 44 (-32.31%)
Mutual labels:  d3js
Coronavirus Countries
COVID-19 interactive dashboard for the whole world
Stars: ✭ 53 (-18.46%)
Mutual labels:  d3js
Wordnetembeddings
Obtaining word embeddings from a WordNet ontology
Stars: ✭ 33 (-49.23%)
Mutual labels:  word-embeddings
Vue D3 Workshop
Workshop content material and excercises for Suncoast Developers
Stars: ✭ 63 (-3.08%)
Mutual labels:  d3js
Degust
An interactive web-tool for RNA-seq analysis
Stars: ✭ 46 (-29.23%)
Mutual labels:  d3js
Nlp overview
Overview of Modern Deep Learning Techniques Applied to Natural Language Processing
Stars: ✭ 1,104 (+1598.46%)
Mutual labels:  word-embeddings
Word2vec Win32
A word2vec port for Windows.
Stars: ✭ 41 (-36.92%)
Mutual labels:  word-embeddings
R Community Explorer
Data-Driven Exploration of the R Community
Stars: ✭ 43 (-33.85%)
Mutual labels:  d3js
Lstm Context Embeddings
Augmenting word embeddings with their surrounding context using bidirectional RNN
Stars: ✭ 57 (-12.31%)
Mutual labels:  word-embeddings
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-40%)
Mutual labels:  word-embeddings
Magrit
♠ Thematic cartography ♠
Stars: ✭ 60 (-7.69%)
Mutual labels:  d3js
Top2vec
Top2Vec learns jointly embedded topic, document and word vectors.
Stars: ✭ 972 (+1395.38%)
Mutual labels:  word-embeddings
Average Word2vec
🔤 Calculate average word embeddings (word2vec) from documents for transfer learning
Stars: ✭ 52 (-20%)
Mutual labels:  word-embeddings
D3
This is the repository for my course, Learning Data Visualization with D3.js on LinkedIn Learning and Lynda.com.
Stars: ✭ 64 (-1.54%)
Mutual labels:  d3js
Sankey
D3 Sankey Diagram Generator with self-loops
Stars: ✭ 61 (-6.15%)
Mutual labels:  d3js
Svg essentials
《SVG精髓》 阅读笔记
Stars: ✭ 59 (-9.23%)
Mutual labels:  d3js

ClusterCat: Fast, Flexible Word Clustering Software

Build Status License: LGPL v3 License: MPL 2.0

Overview

ClusterCat induces word classes from unannotated text. It is programmed in modern C, with no external libraries. A Python wrapper is also provided.

Word classes are unsupervised part-of-speech tags, requiring no manually-annotated corpus. Words are grouped together that share syntactic/semantic similarities. They are used in many dozens of applications within natural language processing, machine translation, neural net training, and related fields.

Installation

Linux

You can use either GCC 4.6+ or Clang 3.7+, but GCC is recommended.

  sudo apt-get update  &&  sudo apt-get install gcc make
  make -j 4

macOS / OSX

The current version of Clang in Xcode doesn't fully support OpenMP, so instead install GCC from Homebrew:

  brew update  &&  brew install [email protected] libomp  &&  xcode-select --install
  make -j 4 CC=/usr/local/bin/gcc-9

Commands

The binary program clustercat gets compiled into the bin directory.

Clustering preprocessed text (already tokenized, normalized, etc) is pretty simple:

  bin/clustercat [options] < train.tok.txt > clusters.tsv

The word-classes are induced from a bidirectional predictive exchange algorithm. The format of the output class file has each line consisting of wordTABclass (a word type, then tab, then class).

Command-line argument usage may be obtained by running with program with the --help flag:

  bin/clustercat --help

Python

Installation and usage details for the Python module are described in a separate readme.

Features

  • Print word vectors (a.k.a. word embeddings) using the --word-vectors flag. The binary format is compatible with word2vec's tools.
  • Start training using an existing word cluster mapping from other clustering software (eg. mkcls) using the --class-file flag.
  • Adjust the number of threads to use with the --threads flag. The default is 8.
  • Adjust the number of clusters or vector dimensions using the --classes flag. The default is approximately the square root of the vocabulary size.
  • Includes compatibility wrapper script bin/mkcls that can be run just like mkcls. You can use more classes now :-)

Comparison

Training Set Brown ClusterCat mkcls Phrasal word2vec
1 Billion English tokens, 800 clusters 12.5 hr 1.4 hr 48.8 hr 5.1 hr 20.6 hr
1 Billion English tokens, 1200 clusters 25.5 hr 1.7 hr 68.8 hr 6.2 hr 33.7 hr
550 Million Russian tokens, 800 clusters 14.6 hr 1.5 hr 75.0 hr 5.5 hr 12.0 hr

Visualization

See bl.ocks.org for nice data visualizations of the clusters for various languages, including English, German, Persian, Hindi, Czech, Catalan, Tajik, Basque, Russian, French, and Maltese.

For example:

French Clustering Thumbnail Russian Clustering Thumbnail Basque Clustering Thumbnail

You can generate your own graphics from ClusterCat's output. Add the flag --print-freqs to ClusterCat, then type the command:

  bin/flat_clusters2json.pl --word-labels < clusters.tsv > visualization/d3/clusters.json

You can either upload the JSON file to gist.github.com, following instructions on the bl.ocks.org front page, or you can view the graphic locally by running a minimal webserver in the visualization/d3 directory:

  python -m SimpleHTTPServer 8116 2>/dev/null &

Then open a tab in your browser to localhost:8116 .

The default settings are sensible for normal usage, but for visualization you probably want much fewer word types and clusters -- less than 10,000 word types and 120 clusters. Your browser will thank you.

Perplexity

The perplexity that ClusterCat reports uses a bidirectional bigram class language model, which is richer than the unidirectional bigram-based perplexities reported by most other software. Richer models provide a better evaluation of the quality of clusters, having more sensitivity (power) to detect improvements. If you want to directly compare the quality of clusters with a different program's output, you have a few options:

  1. Load another clustering using --class-file , and see what the other clustering's initial bidirectional bigram perplexity is before any words get exchanged.
  2. Use an external class-based language model. These are usually two-sided (unlexicalized) models, so they favor two-sided clusterers.
  3. Evaluate on a downstream task. This is best.

Contributions

Contributions are welcome, via pull requests.

Citation

If you use this software please cite the following

Dehdari, Jon, Liling Tan, and Josef van Genabith. 2016. BIRA: Improved Predictive Exchange Word Clustering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 1169–1174, San Diego, CA, USA. Association for Computational Linguistics.

@inproceedings{dehdari-etal2016,
 author    = {Dehdari, Jon  and  Tan, Liling  and  van Genabith, Josef},
 title     = {{BIRA}: Improved Predictive Exchange Word Clustering},
 booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)},
 month     = {June},
 year      = {2016},
 address   = {San Diego, CA, USA},
 publisher = {Association for Computational Linguistics},
 pages     = {1169--1174},
 url       = {http://www.aclweb.org/anthology/N16-1139.pdf}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].