All Projects → bnosac → BTM

bnosac / BTM

Licence: Apache-2.0 license
Biterm Topic Modelling for Short Text with R

Programming Languages

C++
36643 projects - #6 most used programming language
r
7636 projects

Projects that are alternatives of or similar to BTM

stripnet
STriP Net: Semantic Similarity of Scientific Papers (S3P) Network
Stars: ✭ 82 (+5.13%)
Mutual labels:  topic-modeling
Ask2Transformers
A Framework for Textual Entailment based Zero Shot text classification
Stars: ✭ 102 (+30.77%)
Mutual labels:  topic-modeling
tomoto-ruby
High performance topic modeling for Ruby
Stars: ✭ 49 (-37.18%)
Mutual labels:  topic-modeling
keras-aquarium
a small collection of models implemented in keras, including matrix factorization(recommendation system), topic modeling, text classification, etc. Runs on tensorflow.
Stars: ✭ 14 (-82.05%)
Mutual labels:  topic-modeling
bnp
Bayesian nonparametric models for python
Stars: ✭ 17 (-78.21%)
Mutual labels:  topic-modeling
TopicsExplorer
Explore your own text collection with a topic model – without prior knowledge.
Stars: ✭ 53 (-32.05%)
Mutual labels:  topic-modeling
auto-gfqg
Automatic Gap-Fill Question Generation
Stars: ✭ 17 (-78.21%)
Mutual labels:  topic-modeling
enstop
Ensemble topic modelling with pLSA
Stars: ✭ 104 (+33.33%)
Mutual labels:  topic-modeling
LinkedIn Scraper
🙋 A Selenium based automated program that scrapes profiles data,stores in CSV,follows them and saves their profile in PDF.
Stars: ✭ 25 (-67.95%)
Mutual labels:  topic-modeling
contextualLSTM
Contextual LSTM for NLP tasks like word prediction and word embedding creation for Deep Learning
Stars: ✭ 28 (-64.1%)
Mutual labels:  topic-modeling
topic modelling financial news
Topic modelling on financial news with Natural Language Processing
Stars: ✭ 51 (-34.62%)
Mutual labels:  topic-modeling
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+16.67%)
Mutual labels:  topic-modeling
ml
machine learning
Stars: ✭ 29 (-62.82%)
Mutual labels:  topic-modeling
Topic-Modeling-Workshop-with-R
A workshop on analyzing topic modeling (LDA, CTM, STM) using R
Stars: ✭ 51 (-34.62%)
Mutual labels:  topic-modeling
stmprinter
Print multiple stm model dashboards to a pdf file for inspection
Stars: ✭ 34 (-56.41%)
Mutual labels:  topic-modeling
TOM
A library for topic modeling and browsing
Stars: ✭ 91 (+16.67%)
Mutual labels:  topic-modeling
hf-experiments
Experiments with Hugging Face 🔬 🤗
Stars: ✭ 37 (-52.56%)
Mutual labels:  topic-modeling
PyLDA
A Latent Dirichlet Allocation implementation in Python.
Stars: ✭ 51 (-34.62%)
Mutual labels:  topic-modeling
hlda
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
Stars: ✭ 138 (+76.92%)
Mutual labels:  topic-modeling
ml-nlp-services
机器学习、深度学习、自然语言处理
Stars: ✭ 23 (-70.51%)
Mutual labels:  topic-modeling

BTM - Biterm Topic Modelling for Short Text with R

This is an R package wrapping the C++ code available at https://github.com/xiaohuiyan/BTM for constructing a Biterm Topic Model (BTM). This model models word-word co-occurrences patterns (e.g., biterms).

Topic modelling using biterms is particularly good for finding topics in short texts (as occurs in short survey answers or twitter data).

Installation

This R package is on CRAN, just install it with install.packages('BTM')

What

The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)

  • A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
  • BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
  • It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm b=(wi,wj) is defined as: P(b) = sum_k{P(wi|z)*P(wj|z)*P(z)} where k is the number of topics you want to extract.
  • Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for P(w|k)=phi and P(z)=theta.

More detail can be referred to the following paper:

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013. https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

Example

library(udpipe)
library(BTM)
data("brussels_reviews_anno", package = "udpipe")

## Taking only nouns of Dutch data
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]

## Building the model
set.seed(321)
model  <- BTM(x, k = 3, beta = 0.01, iter = 1000, trace = 100)

## Inspect the model - topic frequency + conditional term probabilities
model$theta
[1] 0.3406998 0.2413721 0.4179281

topicterms <- terms(model, top_n = 10)
topicterms
[[1]]
         token probability
1  appartement  0.06168297
2      brussel  0.04057012
3        kamer  0.02372442
4      centrum  0.01550855
5      locatie  0.01547671
6         stad  0.01229227
7        buurt  0.01181460
8     verblijf  0.01155985
9         huis  0.01111402
10         dag  0.01041345

[[2]]
         token probability
1  appartement  0.05687312
2      brussel  0.01888307
3        buurt  0.01883812
4        kamer  0.01465696
5     verblijf  0.01339812
6     badkamer  0.01285862
7   slaapkamer  0.01276870
8          dag  0.01213928
9          bed  0.01195945
10        raam  0.01164474

[[3]]
         token probability
1  appartement 0.061804812
2      brussel 0.035873377
3      centrum 0.022193831
4         huis 0.020091282
5        buurt 0.019935537
6     verblijf 0.018611710
7     aanrader 0.014614272
8        kamer 0.011447470
9      locatie 0.010902365
10      keuken 0.009448751
scores <- predict(model, newdata = x)

Make a specific topic called the background

# If you set background to TRUE
# The first topic is set to a background topic that equals to the empirical word distribution. 
# This can be used to filter out common words.
set.seed(321)
model      <- BTM(x, k = 5, beta = 0.01, background = TRUE, iter = 1000, trace = 100)
topicterms <- terms(model, top_n = 5)
topicterms

Visualisation of your model

library(textplot)
library(ggraph)
library(concaveman)
plot(model)

Provide your own set of biterms

An interesting use case of this package is to

  • cluster based on parts of speech tags like nouns and adjectives which can be found in the text in the neighbourhood of one another
  • cluster dependency relationships provided by NLP tools like udpipe (https://CRAN.R-project.org/package=udpipe)

This can be done by providing your own set of biterms to cluster upon.

Example clustering cooccurrences of nouns/adjectives

library(data.table)
library(udpipe)
## Annotate text with parts of speech tags
data("brussels_reviews", package = "udpipe")
anno <- subset(brussels_reviews, language %in% "nl")
anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)
anno <- udpipe(anno, "dutch", trace = 10)

## Get cooccurrences of nouns / adjectives and proper nouns
biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma, 
                                  relevant = upos %in% c("NOUN", "PROPN", "ADJ"),
                                  skipgram = 2), 
                   by = list(doc_id)]
                   
## Build the model
set.seed(123456)
x     <- subset(anno, upos %in% c("NOUN", "PROPN", "ADJ"))
x     <- x[, c("doc_id", "lemma")]
model <- BTM(x, k = 5, beta = 0.01, iter = 2000, background = TRUE, 
             biterms = biterms, trace = 100)
topicterms <- terms(model, top_n = 5)
topicterms

Example clustering dependency relationships

library(udpipe)
library(tm)
library(data.table)
data("brussels_reviews", package = "udpipe")
exclude <- stopwords("nl")

## Do annotation on Dutch text
anno <- subset(brussels_reviews, language %in% "nl")
anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)
anno <- udpipe(anno, "dutch", trace = 10)
anno <- setDT(anno)
anno <- merge(anno, anno, 
              by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"), 
              by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"), 
              all.x = TRUE, all.y = FALSE, suffixes = c("", "_parent"), sort = FALSE)

## Specify a set of relationships you are interested in (e.g. objects of a verb)
anno$relevant <- anno$dep_rel %in% c("obj") & !is.na(anno$lemma_parent)
biterms <- subset(anno, relevant == TRUE)
biterms <- data.frame(doc_id = biterms$doc_id, 
                      term1 = biterms$lemma, 
                      term2 = biterms$lemma_parent,
                      cooc = 1, 
                      stringsAsFactors = FALSE)
biterms <- subset(biterms, !term1 %in% exclude & !term2 %in% exclude)

## Put in x only terms whch were used in the biterms object such that frequency stats of terms can be computed in BTM
anno <- anno[, keep := relevant | (token_id %in% head_token_id[relevant == TRUE]), by = list(doc_id, paragraph_id, sentence_id)]
x    <- subset(anno, keep == TRUE, select = c("doc_id", "lemma"))
x    <- subset(x, !lemma %in% exclude)

## Build the topic model
model <- BTM(data = x, 
             biterms = biterms, 
             k = 6, iter = 2000, background = FALSE, trace = 100)
topicterms <- terms(model, top_n = 5)
topicterms

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].