All Projects → AdrienGuille → TOM

AdrienGuille / TOM

Licence: MIT license
A library for topic modeling and browsing

Programming Languages

HTML
75241 projects
Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to TOM

Topic Modeling Tool
A point-and-click tool for creating and analyzing topic models produced by MALLET.
Stars: ✭ 85 (-6.59%)
Mutual labels:  topic-modeling
Tmtoolkit
Text Mining and Topic Modeling Toolkit for Python with parallel processing power
Stars: ✭ 135 (+48.35%)
Mutual labels:  topic-modeling
Tomotopy
Python package of Tomoto, the Topic Modeling Tool
Stars: ✭ 213 (+134.07%)
Mutual labels:  topic-modeling
Sttm
Short Text Topic Modeling, JAVA
Stars: ✭ 100 (+9.89%)
Mutual labels:  topic-modeling
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+1792.31%)
Mutual labels:  topic-modeling
Palmetto
Palmetto is a quality measuring tool for topics
Stars: ✭ 144 (+58.24%)
Mutual labels:  topic-modeling
Stminsights
A Shiny Application for Inspecting Structural Topic Models
Stars: ✭ 74 (-18.68%)
Mutual labels:  topic-modeling
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-86.81%)
Mutual labels:  topic-modeling
Hypertools
A Python toolbox for gaining geometric insights into high-dimensional data
Stars: ✭ 1,678 (+1743.96%)
Mutual labels:  topic-modeling
Familia
A Toolkit for Industrial Topic Modeling
Stars: ✭ 2,499 (+2646.15%)
Mutual labels:  topic-modeling
Learning Social Media Analytics With R
This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt
Stars: ✭ 102 (+12.09%)
Mutual labels:  topic-modeling
Lda2vec Pytorch
Topic modeling with word vectors
Stars: ✭ 108 (+18.68%)
Mutual labels:  topic-modeling
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+13925.27%)
Mutual labels:  topic-modeling
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (+0%)
Mutual labels:  topic-modeling
Ldagibbssampling
Open Source Package for Gibbs Sampling of LDA
Stars: ✭ 218 (+139.56%)
Mutual labels:  topic-modeling
Attention Based Aspect Extraction
Code for unsupervised aspect extraction, using Keras and its Backends
Stars: ✭ 75 (-17.58%)
Mutual labels:  topic-modeling
Kate
Code & data accompanying the KDD 2017 paper "KATE: K-Competitive Autoencoder for Text"
Stars: ✭ 135 (+48.35%)
Mutual labels:  topic-modeling
auto-gfqg
Automatic Gap-Fill Question Generation
Stars: ✭ 17 (-81.32%)
Mutual labels:  topic-modeling
Chinese keyphrase extractor
An off-the-shelf tool for Chinese Keyphrase Extraction 一个快速从中文里抽取关键短语的工具,仅占35M内存
Stars: ✭ 237 (+160.44%)
Mutual labels:  topic-modeling
Lftm
Improving topic models LDA and DMM (one-topic-per-document model for short texts) with word embeddings (TACL 2015)
Stars: ✭ 168 (+84.62%)
Mutual labels:  topic-modeling

TOM

TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license. Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus. It also offers a common interface for two topic models (namely LDA using either variational inference or Gibbs sampling, and NMF using alternating least-square with a projected gradient method), and implements three state-of-the-art methods for estimating the optimal number of topics to model a corpus. What is more, TOM constructs an interactive Web-based browser that makes it easy to explore a topic model and the related corpus.

Check out this brief tutorial: http://mediamining.univ-lyon2.fr/people/guille/tom_tutorial.html

Features

Vector space modeling

  • Feature selection based on word frequency
  • Weighting
    • tf
    • tf-idf

Topic modeling

  • Latent Dirichlet Allocation
    • Standard variational Bayesian inference (Latent Dirichlet Allocation. Blei et al, 2003)
    • Online variational Bayesian inference (Online learning for Latent Dirichlet Allocation. Hoffman et al, 2010)
    • Collapsed Gibbs sampling (Finding scientific topics. Griffiths & Steyvers, 2004)
  • Non-negative Matrix Factorization (NMF)
    • Alternating least-square with a projected gradient method (Projected gradient methods for non-negative matrix factorization. Lin, 2007)

Estimating the optimal number of topics

  • Stability analysis (How many topics? Stability analysis for topic models. Greene et al, 2014)
  • Spectral analysis (On finding the natural number of topics with latent dirichlet allocation: Some observations. Arun et al, 2010)
  • Consensus-based analysis (Metagenes and molecular pattern discovery using matrix factorization. Brunet et al, 2004)

Installation

We recommend you to install Anaconda (https://www.continuum.io) which will automatically install most of the required dependencies (i.e. pandas, numpy, scipy, scikit-learn, matplotlib, flask). You should then install the lda module (pip install lda). Eventually, clone or download this repo and run the following command:

python setup.py install

Or install it directly either from PyPi:

pip install tom_lib

or from Conda Cloud:

conda install -c psoriano tom_lib

Usage

We provide two sample programs, topic_model.py (which shows you how to load and prepare a corpus, estimate the optimal number of topics, infer the topic model and then manipulate it) and topic_model_browser.py (which shows you how to generate a topic model browser to explore a corpus), to help you get started using TOM.

Load and prepare a textual corpus

A corpus is a TSV (tab separated values) file describing documents, formatted as follows: a document per line, with at least three columns, namely id (a number), title (a short text) and text (the full content of the document), e.g.:

id	title	text
1	Document 1's title	This is the full content of document 1.
2	Document 2's title	This is the full content of document 2.
etc.

The following code snippet shows how to load a corpus of French documents and vectorize them using tf-idf with unigrams.

corpus = Corpus(source_file_path='input/raw_corpus.csv',
                language='french', 
                vectorization='tfidf', 
                n_gram=1,
                max_relative_frequency=0.8, 
                min_absolute_frequency=4)
print('corpus size:', corpus.size)
print('vocabulary size:', len(corpus.vocabulary))
print('Vector representation of document 0:\n', corpus.vector_for_document(0))

Instantiate a topic model and infer topics

It is possible to instantiate a NMF or LDA object then infer topics.

NMF:

topic_model = NonNegativeMatrixFactorization(corpus)
topic_model.infer_topics(num_topics=15)

LDA (using either the standard variational Bayesian inference or Gibbs sampling):

topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='variational')
topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='gibbs')

Instantiate a topic model and estimate the optimal number of topics

Here we instantiate a NMF object, then generate plots with the three metrics for estimating the optimal number of topics.

topic_model = NonNegativeMatrixFactorization(corpus)
viz = Visualization(topic_model)
viz.plot_greene_metric(min_num_topics=5, 
                       max_num_topics=50, 
                       tao=10, step=1, 
                       top_n_words=10)
viz.plot_arun_metric(min_num_topics=5, 
                     max_num_topics=50, 
                     iterations=10)
viz.plot_brunet_metric(min_num_topics=5, 
                       max_num_topics=50,
                       iterations=10)

Save/load a topic model

To allow reusing previously learned topics models, TOM can save them on disk, as shown below.

utils.save_topic_model(topic_model, 'output/NMF_15topics.tom')
topic_model = utils.load_topic_model('output/NMF_15topics.tom')

Print information about a topic model

This code excerpt illustrates how one can manipulate a topic model, e.g. get the topic distribution for a document or the word distribution for a topic.

print('\nTopics:')
topic_model.print_topics(num_words=10)
print('\nTopic distribution for document 0:',
      topic_model.topic_distribution_for_document(0))
print('\nMost likely topic for document 0:',
      topic_model.most_likely_topic_for_document(0))
print('\nFrequency of topics:',
      topic_model.topics_frequency())
print('\nTop 10 most relevant words for topic 2:',
      topic_model.top_words(2, 10))

Topic model browser: screenshots

Topic cloud

Topic details

Document details

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].