lmcinnes / enstop

Licence: BSD-2-Clause license

Ensemble topic modelling with pLSA

Programming Languages

python

139335 projects - #7 most used programming language

Jupyter Notebook

11667 projects

Projects that are alternatives of or similar to enstop

NMFADMM

A sparsity aware implementation of "Alternating Direction Method of Multipliers for Non-Negative Matrix Factorization with the Beta-Divergence" (ICASSP 2014).

Stars: ✭ 39 (-62.5%)

Mutual labels: matrix-factorization, topic-modeling

Awesome Community Detection

A curated list of community detection research papers with implementations.

Stars: ✭ 1,874 (+1701.92%)

Mutual labels: matrix-factorization, dimensionality-reduction

keras-aquarium

a small collection of models implemented in keras, including matrix factorization(recommendation system), topic modeling, text classification, etc. Runs on tensorflow.

Stars: ✭ 14 (-86.54%)

Mutual labels: matrix-factorization, topic-modeling

LinkedIn Scraper

🙋 A Selenium based automated program that scrapes profiles data,stores in CSV,follows them and saves their profile in PDF.

Stars: ✭ 25 (-75.96%)

Mutual labels: topic-modeling

hf-experiments

Experiments with Hugging Face 🔬 🤗

Stars: ✭ 37 (-64.42%)

Mutual labels: topic-modeling

uapca

Uncertainty-aware principal component analysis.

Stars: ✭ 16 (-84.62%)

Mutual labels: dimensionality-reduction

NLP-paper

🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/

Stars: ✭ 23 (-77.88%)

Mutual labels: plsa

bnp

Bayesian nonparametric models for python

Stars: ✭ 17 (-83.65%)

Mutual labels: topic-modeling

ParametricUMAP paper

Parametric UMAP embeddings for representation and semisupervised learning. From the paper "Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning" (Sainburg, McInnes, Gentner, 2020).

Stars: ✭ 132 (+26.92%)

Mutual labels: dimensionality-reduction

contextualLSTM

Contextual LSTM for NLP tasks like word prediction and word embedding creation for Deep Learning

Stars: ✭ 28 (-73.08%)

Mutual labels: topic-modeling

REGAL

Representation learning-based graph alignment based on implicit matrix factorization and structural embeddings

Stars: ✭ 78 (-25%)

Mutual labels: matrix-factorization

TopicsExplorer

Explore your own text collection with a topic model – without prior knowledge.

Stars: ✭ 53 (-49.04%)

Mutual labels: topic-modeling

tomoto-ruby

High performance topic modeling for Ruby

Stars: ✭ 49 (-52.88%)

Mutual labels: topic-modeling

Ask2Transformers

A Framework for Textual Entailment based Zero Shot text classification

Stars: ✭ 102 (-1.92%)

Mutual labels: topic-modeling

recommender system with Python

recommender system tutorial with Python

Stars: ✭ 106 (+1.92%)

Mutual labels: matrix-factorization

federated pca

Federated Principal Component Analysis Revisited!

Stars: ✭ 30 (-71.15%)

Mutual labels: dimensionality-reduction

stmprinter

Print multiple stm model dashboards to a pdf file for inspection

Stars: ✭ 34 (-67.31%)

Mutual labels: topic-modeling

ml-nlp-services

机器学习、深度学习、自然语言处理

Stars: ✭ 23 (-77.88%)

Mutual labels: topic-modeling

Recommendation.jl

Building recommender systems in Julia

Stars: ✭ 42 (-59.62%)

Mutual labels: matrix-factorization

M-NMF

An implementation of "Community Preserving Network Embedding" (AAAI 2017)

Stars: ✭ 119 (+14.42%)

Mutual labels: matrix-factorization

View All Similar Projects ➔

EnsTop

EnsTop provides an ensemble based approach to topic modelling using pLSA. It makes use of a high performance numba based pLSA implementation to run multiple bootstrapped topic models in parallel, and then clusters the resulting outputs to determine a set of stable topics. It can then refit the document vectors against these topics embed documents into the stable topic space.

Why use EnsTop?

There are a number of advantages to using an ensemble approach to topic modelling. The most obvious is that it produces better more stable topics. A close second, however, is that, by making use of HDBSCAN for clustering topics, it can learn a "natural" number of topics. That is, while the user needs to specify an estimated number of topics, the actual number of topics produced will be determined by how many stable topics are produced over many bootstrapped runs. In practice this can either be more, or less, than the estimated number of topics.

Despite all of these extra features the ensemble topic approach is still very efficient, especially in multi-core environments (due the the embarrassingly parallel nature of the ensemble). A run with a reasonable size ensemble can be completed in around the same time it might take to fit an LDA model, and usually produces superior quality results.

In addition to this EnsTop comes with a pLSA implementation that can be used standalone (and not as part of an ensemble). So if all you are loosing for is a good fast pLSA implementation (that can run considerably faster than many LDA implementations) then EnsTop is the library for you.

How to use EnsTop

EnsTop follows the sklearn API (and inherits from sklearn base classes), so if you use sklearn for LDA or NMF then you already know how to use Enstop. General usage is very straightforward. The following example uses EnsTop to model topics from the classic 20-Newsgroups dataset, using sklearn's CountVectorizer to generate the required count matrix.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from enstop import EnsembleTopics

news = fetch_20newsgroups(subset='all')
data = CountVectorizer().fit_transform(news.data)

model = EnsembleTopics(n_components=20).fit(data)
topics = model.components_
doc_vectors = model.embedding_

How to use pLSA

EnsTop also provides a simple to use but fast and effective pLSA implementation out of the box. As with the ensemble topic modeller it follows the sklearn API, and usage is very similar.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from enstop import PLSA

news = fetch_20newsgroups(subset='all')
data = CountVectorizer().fit_transform(news.data)

model = PLSA(n_components=20).fit(data)
topics = model.components_
doc_vectors = model.embedding_

Installation

The easiest way to install EnsTop is via pip

pip install enstop

To manually install this package:

wget https://github.com/lmcinnes/enstop/archive/master.zip
unzip master.zip
rm master.zip
cd enstop-master
python setup.py install

Help and Support

Some basic example notebooks are available here.

Documentation is coming. This project is still very young. If you need help, or have problems please open an issue and I will try to provide any help and guidance that I can. Please also check the docstrings on the code, which provide some descriptions of the parameters.

License

The EnsTop package is 2-clause BSD licensed.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

lmcinnes / enstop

Programming Languages

Labels

Projects that are alternatives of or similar to enstop

EnsTop

Why use EnsTop?

How to use EnsTop

How to use pLSA

Installation

Help and Support

License

Contributing