Latent Semantic Scaling

NOTICE: This R package is renamed from LSS to LSX for CRAN submission.

In quantitative text analysis, the cost to train supervised machine learning models tend to be very high when the corpus is large. LSS is a semisupervised document scaling method that I developed to perform large scale analysis at low cost. Taking user-provided seed words as weak supervision, it estimates polarity of words in the corpus by latent semantic analysis and locates documents on a unidimensional scale (e.g. sentiment).

Please read my paper for the algorithm and methodology:

Watanabe, Kohei. 2020. “Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages”, Communication Methods and Measures.

How to install

devtools::install_github("koheiw/LSX")

How to use

LSS estimates semantic similarity of words based on their surrounding contexts, so a LSS model should be trained on data where the text unit is sentence. It is also affected by noises in data such as function words and punctuation marks, so they should also be removed. It requires larger corpus of texts (5000 or more documents) to accurately estimate semantic proximity. The sample corpus contains 10,000 Guardian news articles from 2016.

Fit a LSS model

require(quanteda)
require(LSX) # changed from LSS to LSX

corp <- readRDS(url("https://bit.ly/2GZwLcN", "rb"))

toks_sent <- corp %>% 
    corpus_reshape("sentences") %>% 
    tokens(remove_punct = TRUE) %>% 
    tokens_remove(stopwords("en"), padding = TRUE)
dfmt_sent <- toks_sent %>% 
    dfm(revemo_padding = TRUE) %>%
    dfm_select("^\\p{L}+$", valuetype = "regex", min_nchar = 2) %>% 
    dfm_trim(min_termfreq = 5)

## Warning: revemo_padding argument is not used.

eco <- char_context(toks_sent, "econom*", p = 0.05)
lss <- textmodel_lss(dfmt_sent, as.seedwords(data_dictionary_sentiment), 
                     terms = eco, k = 300, cache = TRUE)

## Reading cache file: lss_cache/svds_e5089465ba658d1a.RDS

Sentiment seed words

Seed words are 14 generic sentiment words.

data_dictionary_sentiment

## Dictionary object with 2 key entries.
## - [positive]:
##   - good, nice, excellent, positive, fortunate, correct, superior
## - [negative]:
##   - bad, nasty, poor, negative, unfortunate, wrong, inferior

Economic sentiment words

Economic words are weighted in terms of sentiment based on the proximity to seed words.

head(coef(lss), 20) # most positive words

##        good       shape    positive sustainable   expecting      remain 
##  0.10086678  0.08100301  0.07287992  0.06489614  0.06459612  0.06327885 
##    emerging      decent   continued challenging        asia  powerhouse 
##  0.06173428  0.06158674  0.05958519  0.05735492  0.05545359  0.05454087 
##        drag      argued       china         hit       stock       start 
##  0.05430134  0.05425140  0.05269536  0.05213953  0.05177975  0.05162649 
##   weakening consultancy 
##  0.05153202  0.05108261

tail(coef(lss), 20) # most negative words

##      raising         rise     sterling      cutting        grows       shrink 
##  -0.07002333  -0.07106325  -0.07220668  -0.07389086  -0.07568230  -0.07626922 
## implications        basic         debt policymakers    suggested     interest 
##  -0.07767036  -0.07848986  -0.07896652  -0.07970222  -0.08267444  -0.08631343 
## unemployment    borrowing         hike         rate          rba        rates 
##  -0.08879022  -0.09109017  -0.09224650  -0.09598675  -0.09672486  -0.09754047 
##          cut     negative 
##  -0.11047689  -0.12472812

This plot shows that frequent words (“world”, “country”, “uk”) are neutral while less frequent words such as “borrowing”, “unemployment”, “debt”, “emerging”, “efficient” are “sustainable” are either negative or positive.

textplot_terms(lss, 
               highlighted = c("world", "business", "uk",
                               "borrowing", "unemployment", "debt",
                               "emerging", "efficient", "sustainable"))

Result of analysis

In the plots, circles indicate sentiment of individual news articles and lines are their local average (solid line) with a confidence band (dotted lines). According to the plot, economic sentiment in the Guardian news stories became negative from February to April, but it become more positive in April. As the referendum approaches, the newspaper’s sentiment became less stable, although it became close to neutral (overall mean) on the day of voting (broken line).

dfmt <- dfm_group(dfmt_sent)

# predict sentiment scores
pred <- as.data.frame(predict(lss, se.fit = TRUE, newdata = dfmt))
pred$date <- docvars(dfmt, "date")

# smooth LSS scores
pred_sm <- smooth_lss(pred, from = as.Date("2016-01-01"), to = as.Date("2016-12-31"))

# plot trend
plot(pred$date, pred$fit, col = rgb(0, 0, 0, 0.05), pch = 16, ylim = c(-0.5, 0.5),
     xlab = "Time", ylab = "Negative vs. positive", main = "Economic sentiment in the Guardian")
lines(pred_sm$date, pred_sm$fit, type = "l")
lines(pred_sm$date, pred_sm$fit + pred_sm$se.fit * 2, type = "l", lty = 3)
lines(pred_sm$date, pred_sm$fit - pred_sm$se.fit * 2, type = "l", lty = 3)
abline(h = 0, v = as.Date("2016-06-23"), lty = c(1, 2))
text(as.Date("2016-06-23"), 0.4, "Brexit referendum")

Examples

Please read the following papers for how to use LSS in social science research:

Trubowitz, Peter and Watanabe, Kohei. 2021. “The Geopolitical Threat Index: A Text-Based Computational Approach to Identifying Foreign Threats”, International Studies Quarterly.
Vydra, Simon and Kantorowicz, Jaroslaw. “Tracing Policy-relevant Information in Social Media: The Case of Twitter before and during the COVID-19 Crisis”. Statistics, Politics and Policy.
Kinoshita, Hiroko. 2020. “A Quantitative Text Analysis Approach on LGBTQ Issues in Contemporary Indonesia”. Journal of Population and Social Studies.
Yamao, Dai. 2020. “Re-securitization as Evasion of Responsibility: A Quantitative Text Analysis of Refugee Crisis in Major Arabic Newspapers”, Journal of Population and Social Studies.
Watanabe, Kohei. 2017. “Measuring News Bias: Russia’s Official News Agency ITAR-TASS’s Coverage of the Ukraine Crisis”, European Journal Communication.
Watanabe, Kohei. 2017. “The spread of the Kremlin’s narratives by a western news agency during the Ukraine crisis”", Journal of International Communication.
Lankina, Tomila and Watanabe, Kohei. 2017. “‘Russian Spring’ or ‘Spring Betrayal’? The Media as a Mirror of Putin’s Evolving Strategy in Ukraine”, Europe-Asia Studies.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

koheiw / LSX

Programming Languages

Labels

Projects that are alternatives of or similar to LSX