All Projects → koheiw → LSX

koheiw / LSX

Licence: other
A word embeddings-based semi-supervised model for document scaling

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to LSX

visualization
Text visualization tools
Stars: ✭ 18 (-57.14%)
Mutual labels:  sentiment-analysis, text-analysis
Awesome Sentiment Analysis
Repository with all what is necessary for sentiment analysis and related areas
Stars: ✭ 459 (+992.86%)
Mutual labels:  sentiment-analysis, text-analysis
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+752.38%)
Mutual labels:  sentiment-analysis, text-analysis
Shifterator
Interpretable data visualizations for understanding how texts differ at the word level
Stars: ✭ 209 (+397.62%)
Mutual labels:  sentiment-analysis, text-analysis
workshop-IJTA
Rによる日本語テキスト分析入門
Stars: ✭ 25 (-40.48%)
Mutual labels:  text-analysis, quanteda
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-71.43%)
Mutual labels:  sentiment-analysis, text-analysis
Orange3 Text
🍊 📄 Text Mining add-on for Orange3
Stars: ✭ 83 (+97.62%)
Mutual labels:  sentiment-analysis, text-analysis
TextMood
A Xamarin + IoT + Azure sample that detects the sentiment of incoming text messages, performs sentiment analysis on the text, and changes the color of a Philips Hue lightbulb
Stars: ✭ 52 (+23.81%)
Mutual labels:  sentiment-analysis, text-analysis
quanteda.corpora
A collection of corpora for quanteda
Stars: ✭ 17 (-59.52%)
Mutual labels:  text-analysis, quanteda
restaurant-finder-featureReviews
Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).
Stars: ✭ 21 (-50%)
Mutual labels:  sentiment-analysis
NewsMTSC
Target-dependent sentiment classification in news articles reporting on political events. Includes a high-quality data set of over 11k sentences and a state-of-the-art classification model.
Stars: ✭ 54 (+28.57%)
Mutual labels:  sentiment-analysis
stocktwits-sentiment
Stocktwits market sentiment analysis in Python with Keras and TensorFlow.
Stars: ✭ 23 (-45.24%)
Mutual labels:  sentiment-analysis
PBAN-PyTorch
A Position-aware Bidirectional Attention Network for Aspect-level Sentiment Analysis, PyTorch implementation.
Stars: ✭ 33 (-21.43%)
Mutual labels:  sentiment-analysis
wink-sentiment
Accurate and fast sentiment scoring of phrases with #hashtags, emoticons :) & emojis 🎉
Stars: ✭ 51 (+21.43%)
Mutual labels:  sentiment-analysis
SentimentAnalysis
Sentiment Analysis: Deep Bi-LSTM+attention model
Stars: ✭ 32 (-23.81%)
Mutual labels:  sentiment-analysis
bert sa
bert sentiment analysis tensorflow serving with RESTful API
Stars: ✭ 35 (-16.67%)
Mutual labels:  sentiment-analysis
fsauor2018
基于LSTM网络与自注意力机制对中文评论进行细粒度情感分析
Stars: ✭ 36 (-14.29%)
Mutual labels:  sentiment-analysis
HurdleDMR.jl
Hurdle Distributed Multinomial Regression (HDMR) implemented in Julia
Stars: ✭ 19 (-54.76%)
Mutual labels:  text-analysis
Giveme5W
Extraction of the five journalistic W-questions (5W) from news articles
Stars: ✭ 16 (-61.9%)
Mutual labels:  text-analysis
YelpDatasetSQL
Working with the Yelp Dataset in Azure SQL and SQL Server
Stars: ✭ 16 (-61.9%)
Mutual labels:  text-analysis

Latent Semantic Scaling

CRAN Version Downloads Total Downloads R build status codecov

NOTICE: This R package is renamed from LSS to LSX for CRAN submission.

In quantitative text analysis, the cost to train supervised machine learning models tend to be very high when the corpus is large. LSS is a semisupervised document scaling method that I developed to perform large scale analysis at low cost. Taking user-provided seed words as weak supervision, it estimates polarity of words in the corpus by latent semantic analysis and locates documents on a unidimensional scale (e.g. sentiment).

Please read my paper for the algorithm and methodology:

How to install

devtools::install_github("koheiw/LSX")

How to use

LSS estimates semantic similarity of words based on their surrounding contexts, so a LSS model should be trained on data where the text unit is sentence. It is also affected by noises in data such as function words and punctuation marks, so they should also be removed. It requires larger corpus of texts (5000 or more documents) to accurately estimate semantic proximity. The sample corpus contains 10,000 Guardian news articles from 2016.

Fit a LSS model

require(quanteda)
require(LSX) # changed from LSS to LSX
corp <- readRDS(url("https://bit.ly/2GZwLcN", "rb"))
toks_sent <- corp %>% 
    corpus_reshape("sentences") %>% 
    tokens(remove_punct = TRUE) %>% 
    tokens_remove(stopwords("en"), padding = TRUE)
dfmt_sent <- toks_sent %>% 
    dfm(revemo_padding = TRUE) %>%
    dfm_select("^\\p{L}+$", valuetype = "regex", min_nchar = 2) %>% 
    dfm_trim(min_termfreq = 5)
## Warning: revemo_padding argument is not used.
eco <- char_context(toks_sent, "econom*", p = 0.05)
lss <- textmodel_lss(dfmt_sent, as.seedwords(data_dictionary_sentiment), 
                     terms = eco, k = 300, cache = TRUE)
## Reading cache file: lss_cache/svds_e5089465ba658d1a.RDS

Sentiment seed words

Seed words are 14 generic sentiment words.

data_dictionary_sentiment
## Dictionary object with 2 key entries.
## - [positive]:
##   - good, nice, excellent, positive, fortunate, correct, superior
## - [negative]:
##   - bad, nasty, poor, negative, unfortunate, wrong, inferior

Economic sentiment words

Economic words are weighted in terms of sentiment based on the proximity to seed words.

head(coef(lss), 20) # most positive words
##        good       shape    positive sustainable   expecting      remain 
##  0.10086678  0.08100301  0.07287992  0.06489614  0.06459612  0.06327885 
##    emerging      decent   continued challenging        asia  powerhouse 
##  0.06173428  0.06158674  0.05958519  0.05735492  0.05545359  0.05454087 
##        drag      argued       china         hit       stock       start 
##  0.05430134  0.05425140  0.05269536  0.05213953  0.05177975  0.05162649 
##   weakening consultancy 
##  0.05153202  0.05108261
tail(coef(lss), 20) # most negative words
##      raising         rise     sterling      cutting        grows       shrink 
##  -0.07002333  -0.07106325  -0.07220668  -0.07389086  -0.07568230  -0.07626922 
## implications        basic         debt policymakers    suggested     interest 
##  -0.07767036  -0.07848986  -0.07896652  -0.07970222  -0.08267444  -0.08631343 
## unemployment    borrowing         hike         rate          rba        rates 
##  -0.08879022  -0.09109017  -0.09224650  -0.09598675  -0.09672486  -0.09754047 
##          cut     negative 
##  -0.11047689  -0.12472812

This plot shows that frequent words (“world”, “country”, “uk”) are neutral while less frequent words such as “borrowing”, “unemployment”, “debt”, “emerging”, “efficient” are “sustainable” are either negative or positive.

textplot_terms(lss, 
               highlighted = c("world", "business", "uk",
                               "borrowing", "unemployment", "debt",
                               "emerging", "efficient", "sustainable"))

Result of analysis

In the plots, circles indicate sentiment of individual news articles and lines are their local average (solid line) with a confidence band (dotted lines). According to the plot, economic sentiment in the Guardian news stories became negative from February to April, but it become more positive in April. As the referendum approaches, the newspaper’s sentiment became less stable, although it became close to neutral (overall mean) on the day of voting (broken line).

dfmt <- dfm_group(dfmt_sent)

# predict sentiment scores
pred <- as.data.frame(predict(lss, se.fit = TRUE, newdata = dfmt))
pred$date <- docvars(dfmt, "date")

# smooth LSS scores
pred_sm <- smooth_lss(pred, from = as.Date("2016-01-01"), to = as.Date("2016-12-31"))

# plot trend
plot(pred$date, pred$fit, col = rgb(0, 0, 0, 0.05), pch = 16, ylim = c(-0.5, 0.5),
     xlab = "Time", ylab = "Negative vs. positive", main = "Economic sentiment in the Guardian")
lines(pred_sm$date, pred_sm$fit, type = "l")
lines(pred_sm$date, pred_sm$fit + pred_sm$se.fit * 2, type = "l", lty = 3)
lines(pred_sm$date, pred_sm$fit - pred_sm$se.fit * 2, type = "l", lty = 3)
abline(h = 0, v = as.Date("2016-06-23"), lty = c(1, 2))
text(as.Date("2016-06-23"), 0.4, "Brexit referendum")

Examples

Please read the following papers for how to use LSS in social science research:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].