All Projects → koheiw → seededlda

koheiw / seededlda

Licence: other
Semisupervided LDA for theory-driven text analysis

Programming Languages

r
7636 projects
C++
36643 projects - #6 most used programming language

Projects that are alternatives of or similar to seededlda

Bible text gcn
Pytorch implementation of "Graph Convolutional Networks for Text Classification"
Stars: ✭ 90 (+95.65%)
Mutual labels:  text-classification, semi-supervised-learning
ganbert-pytorch
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks in Pytorch/HuggingFace
Stars: ✭ 60 (+30.43%)
Mutual labels:  text-classification, semi-supervised-learning
character-level-cnn
Keras implementation of Character-level CNN for Text Classification
Stars: ✭ 56 (+21.74%)
Mutual labels:  text-classification
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+397.83%)
Mutual labels:  text-classification
X-Transformer
X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
Stars: ✭ 127 (+176.09%)
Mutual labels:  text-classification
sesemi
supervised and semi-supervised image classification with self-supervision (Keras)
Stars: ✭ 43 (-6.52%)
Mutual labels:  semi-supervised-learning
Graph-Based-TC
Graph-based framework for text classification
Stars: ✭ 24 (-47.83%)
Mutual labels:  text-classification
RE2RNN
Source code for the EMNLP 2020 paper "Cold-Start and Interpretability: Turning Regular Expressions intoTrainable Recurrent Neural Networks"
Stars: ✭ 96 (+108.7%)
Mutual labels:  text-classification
deepOF
TensorFlow implementation for "Guided Optical Flow Learning"
Stars: ✭ 26 (-43.48%)
Mutual labels:  semi-supervised-learning
Ask2Transformers
A Framework for Textual Entailment based Zero Shot text classification
Stars: ✭ 102 (+121.74%)
Mutual labels:  text-classification
ganbert
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks
Stars: ✭ 205 (+345.65%)
Mutual labels:  semi-supervised-learning
rnn-text-classification-tf
Tensorflow implementation of Attention-based Bidirectional RNN text classification.
Stars: ✭ 26 (-43.48%)
Mutual labels:  text-classification
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-71.74%)
Mutual labels:  text-classification
Cross-Speaker-Emotion-Transfer
PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
Stars: ✭ 107 (+132.61%)
Mutual labels:  semi-supervised-learning
pyroVED
Invariant representation learning from imaging and spectral data
Stars: ✭ 23 (-50%)
Mutual labels:  semi-supervised-learning
text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (-30.43%)
Mutual labels:  text-classification
GPQ
Generalized Product Quantization Network For Semi-supervised Image Retrieval - CVPR 2020
Stars: ✭ 60 (+30.43%)
Mutual labels:  semi-supervised-learning
Billion-scale-semi-supervised-learning
Implementing Billion-scale semi-supervised learning for image classification using Pytorch
Stars: ✭ 81 (+76.09%)
Mutual labels:  semi-supervised-learning
watson-document-classifier
Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.
Stars: ✭ 41 (-10.87%)
Mutual labels:  text-classification
generative models
Pytorch implementations of generative models: VQVAE2, AIR, DRAW, InfoGAN, DCGAN, SSVAE
Stars: ✭ 82 (+78.26%)
Mutual labels:  semi-supervised-learning

Seeded-LDA for semisupervised topic modeling

CRAN Version Downloads Total Downloads R build status codecov

seededlda is an R package that implements the seeded-LDA for semisupervised topic modeling using quanteda. The seeded-LDA model was proposed by Lu et al. (2010). Until version 0.3, that packages has been a simple wrapper around the topicmodels package, but the LDA estimator is newly implemented in C++ using the GibbsLDA++ library to be submitted to CRAN in August 2020. The author believes this package implements the seeded-LDA model more closely to the original proposal.

Please see Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches for the overview of semisupervised topic classification techniques and their advantages in social science research.

keyATM is the latest addition to the semisupervised topic models. The users of seeded-LDA are also encouraged to use that package.

Install

install.packages("devtools")
devtools::install_github("koheiw/seededlda") 

Example

The corpus and seed words in this example are from Conspiracist propaganda: How Russia promotes anti-establishment sentiment online?.

require(quanteda)
require(seededlda)

Users of seeded-LDA must provided a small dictionary of keywords (seed words) to define the desired topics.

dict <- dictionary(file = "tests/data/topics.yml")
print(dict)
## Dictionary object with 5 key entries.
## - [economy]:
##   - market*, money, bank*, stock*, bond*, industry, company, shop*
## - [politics]:
##   - parliament*, congress*, white house, party leader*, party member*, voter*, lawmaker*, politician*
## - [society]:
##   - police, prison*, school*, hospital*
## - [diplomacy]:
##   - ambassador*, diplomat*, embassy, treaty
## - [military]:
##   - military, soldier*, terrorist*, air force, marine, navy, army
corp <- readRDS("tests/data/data_corpus_sputnik.RDS")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) %>%
        tokens_select(min_nchar = 2) %>% 
        tokens_compound(dict) # for multi-word expressions
dfmt <- dfm(toks) %>% 
    dfm_remove(stopwords('en')) %>% 
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile", 
             max_docfreq = 0.2, docfreq_type = "prop")

Many of the top terms of the seeded-LDA are seed words but related topic words are also identified. The result includes “other” as a junk topic because residual = TRUE.

set.seed(1234)
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)
print(terms(slda, 20))
##       economy     politics        society           diplomacy   
##  [1,] "company"   "parliament"    "police"          "diplomatic"
##  [2,] "money"     "congress"      "school"          "embassy"   
##  [3,] "market"    "white_house"   "hospital"        "ambassador"
##  [4,] "bank"      "politicians"   "prison"          "treaty"    
##  [5,] "industry"  "parliamentary" "schools"         "diplomat"  
##  [6,] "banks"     "lawmakers"     "pic.twitter.com" "diplomats" 
##  [7,] "markets"   "voters"        "media"           "like"      
##  [8,] "banking"   "lawmaker"      "reported"        "just"      
##  [9,] "stock"     "politician"    "local"           "now"       
## [10,] "stockholm" "minister"      "information"     "think"     
## [11,] "china"     "european"      "video"           "even"      
## [12,] "percent"   "sanctions"     "public"          "trump"     
## [13,] "chinese"   "eu"            "social"          "going"     
## [14,] "economic"  "political"     "court"           "made"      
## [15,] "india"     "party"         "women"           "years"     
## [16,] "year"      "foreign"       "man"             "way"       
## [17,] "oil"       "prime"         "report"          "say"       
## [18,] "project"   "union"         "found"           "want"      
## [19,] "billion"   "moscow"        "investigation"   "many"      
## [20,] "million"   "trump"         "department"      "really"    
##       military        other      
##  [1,] "army"          "north"    
##  [2,] "terrorist"     "nuclear"  
##  [3,] "navy"          "korea"    
##  [4,] "terrorists"    "south"    
##  [5,] "air_force"     "iran"     
##  [6,] "soldiers"      "trump"    
##  [7,] "marine"        "korean"   
##  [8,] "soldier"       "world"    
##  [9,] "defense"       "israel"   
## [10,] "syria"         "deal"     
## [11,] "syrian"        "saudi"    
## [12,] "forces"        "kim"      
## [13,] "security"      "show"     
## [14,] "nato"          "israeli"  
## [15,] "weapons"       "agreement"
## [16,] "daesh"         "program"  
## [17,] "turkish"       "cup"      
## [18,] "turkey"        "trump's"  
## [19,] "international" "japan"    
## [20,] "group"         "peace"
topic <- table(topics(slda))
print(topic)
## 
##   economy  politics   society diplomacy  military     other 
##       136       181       262       158       144       119

Examples

Please read the following papers for how to apply seeded-LDA in social science research:

Curini, Luigi and Vignoli, Valerio. 2021. Committed Moderates and Uncommitted Extremists: Ideological Leaning and Parties’ Narratives on Military Interventions in Italy, Foreign Policy Analysis.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].