Latent Semantic Scaling
NOTICE: This R package is renamed from LSS to LSX for CRAN submission.
In quantitative text analysis, the cost to train supervised machine learning models tend to be very high when the corpus is large. LSS is a semisupervised document scaling method that I developed to perform large scale analysis at low cost. Taking user-provided seed words as weak supervision, it estimates polarity of words in the corpus by latent semantic analysis and locates documents on a unidimensional scale (e.g. sentiment).
Please read my paper for the algorithm and methodology:
- Watanabe, Kohei. 2020. “Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages”, Communication Methods and Measures.
How to install
devtools::install_github("koheiw/LSX")
How to use
LSS estimates semantic similarity of words based on their surrounding contexts, so a LSS model should be trained on data where the text unit is sentence. It is also affected by noises in data such as function words and punctuation marks, so they should also be removed. It requires larger corpus of texts (5000 or more documents) to accurately estimate semantic proximity. The sample corpus contains 10,000 Guardian news articles from 2016.
Fit a LSS model
require(quanteda)
require(LSX) # changed from LSS to LSX
corp <- readRDS(url("https://bit.ly/2GZwLcN", "rb"))
toks_sent <- corp %>%
corpus_reshape("sentences") %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
dfmt_sent <- toks_sent %>%
dfm(revemo_padding = TRUE) %>%
dfm_select("^\\p{L}+$", valuetype = "regex", min_nchar = 2) %>%
dfm_trim(min_termfreq = 5)
## Warning: revemo_padding argument is not used.
eco <- char_context(toks_sent, "econom*", p = 0.05)
lss <- textmodel_lss(dfmt_sent, as.seedwords(data_dictionary_sentiment),
terms = eco, k = 300, cache = TRUE)
## Reading cache file: lss_cache/svds_e5089465ba658d1a.RDS
Sentiment seed words
Seed words are 14 generic sentiment words.
data_dictionary_sentiment
## Dictionary object with 2 key entries.
## - [positive]:
## - good, nice, excellent, positive, fortunate, correct, superior
## - [negative]:
## - bad, nasty, poor, negative, unfortunate, wrong, inferior
Economic sentiment words
Economic words are weighted in terms of sentiment based on the proximity to seed words.
head(coef(lss), 20) # most positive words
## good shape positive sustainable expecting remain
## 0.10086678 0.08100301 0.07287992 0.06489614 0.06459612 0.06327885
## emerging decent continued challenging asia powerhouse
## 0.06173428 0.06158674 0.05958519 0.05735492 0.05545359 0.05454087
## drag argued china hit stock start
## 0.05430134 0.05425140 0.05269536 0.05213953 0.05177975 0.05162649
## weakening consultancy
## 0.05153202 0.05108261
tail(coef(lss), 20) # most negative words
## raising rise sterling cutting grows shrink
## -0.07002333 -0.07106325 -0.07220668 -0.07389086 -0.07568230 -0.07626922
## implications basic debt policymakers suggested interest
## -0.07767036 -0.07848986 -0.07896652 -0.07970222 -0.08267444 -0.08631343
## unemployment borrowing hike rate rba rates
## -0.08879022 -0.09109017 -0.09224650 -0.09598675 -0.09672486 -0.09754047
## cut negative
## -0.11047689 -0.12472812
This plot shows that frequent words (“world”, “country”, “uk”) are neutral while less frequent words such as “borrowing”, “unemployment”, “debt”, “emerging”, “efficient” are “sustainable” are either negative or positive.
textplot_terms(lss,
highlighted = c("world", "business", "uk",
"borrowing", "unemployment", "debt",
"emerging", "efficient", "sustainable"))
Result of analysis
In the plots, circles indicate sentiment of individual news articles and lines are their local average (solid line) with a confidence band (dotted lines). According to the plot, economic sentiment in the Guardian news stories became negative from February to April, but it become more positive in April. As the referendum approaches, the newspaper’s sentiment became less stable, although it became close to neutral (overall mean) on the day of voting (broken line).
dfmt <- dfm_group(dfmt_sent)
# predict sentiment scores
pred <- as.data.frame(predict(lss, se.fit = TRUE, newdata = dfmt))
pred$date <- docvars(dfmt, "date")
# smooth LSS scores
pred_sm <- smooth_lss(pred, from = as.Date("2016-01-01"), to = as.Date("2016-12-31"))
# plot trend
plot(pred$date, pred$fit, col = rgb(0, 0, 0, 0.05), pch = 16, ylim = c(-0.5, 0.5),
xlab = "Time", ylab = "Negative vs. positive", main = "Economic sentiment in the Guardian")
lines(pred_sm$date, pred_sm$fit, type = "l")
lines(pred_sm$date, pred_sm$fit + pred_sm$se.fit * 2, type = "l", lty = 3)
lines(pred_sm$date, pred_sm$fit - pred_sm$se.fit * 2, type = "l", lty = 3)
abline(h = 0, v = as.Date("2016-06-23"), lty = c(1, 2))
text(as.Date("2016-06-23"), 0.4, "Brexit referendum")
Examples
Please read the following papers for how to use LSS in social science research:
- Trubowitz, Peter and Watanabe, Kohei. 2021. “The Geopolitical Threat Index: A Text-Based Computational Approach to Identifying Foreign Threats”, International Studies Quarterly.
- Vydra, Simon and Kantorowicz, Jaroslaw. “Tracing Policy-relevant Information in Social Media: The Case of Twitter before and during the COVID-19 Crisis”. Statistics, Politics and Policy.
- Kinoshita, Hiroko. 2020. “A Quantitative Text Analysis Approach on LGBTQ Issues in Contemporary Indonesia”. Journal of Population and Social Studies.
- Yamao, Dai. 2020. “Re-securitization as Evasion of Responsibility: A Quantitative Text Analysis of Refugee Crisis in Major Arabic Newspapers”, Journal of Population and Social Studies.
- Watanabe, Kohei. 2017. “Measuring News Bias: Russia’s Official News Agency ITAR-TASS’s Coverage of the Ukraine Crisis”, European Journal Communication.
- Watanabe, Kohei. 2017. “The spread of the Kremlin’s narratives by a western news agency during the Ukraine crisis”", Journal of International Communication.
- Lankina, Tomila and Watanabe, Kohei. 2017. “‘Russian Spring’ or ‘Spring Betrayal’? The Media as a Mirror of Putin’s Evolving Strategy in Ukraine”, Europe-Asia Studies.