Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → juliasilge → Tidylo

juliasilge / Tidylo

Licence: other

Weighted tidy log odds ratio ⚖️

Programming Languages

7636 projects

Labels

tidyverse

Projects that are alternatives of or similar to Tidylo

Tidy

Tidy up your data with JavaScript, inspired by dplyr and the tidyverse

Stars: ✭ 307 (+395.16%)

Mutual labels: tidyverse

Tidyquant

Bringing financial analysis to the tidyverse

Stars: ✭ 635 (+924.19%)

Mutual labels: tidyverse

Janitor

simple tools for data cleaning in R

Stars: ✭ 981 (+1482.26%)

Mutual labels: tidyverse

Timetk

A toolkit for working with time series in R

Stars: ✭ 371 (+498.39%)

Mutual labels: tidyverse

Forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)

Stars: ✭ 441 (+611.29%)

Mutual labels: tidyverse

Nanny

A tidyverse suite for (pre-) machine-learning: cluster, PCA, permute, impute, rotate, redundancy, triangular, smart-subset, abundant and variable features.

Stars: ✭ 17 (-72.58%)

Mutual labels: tidyverse

chilemapas

Mapas terrestres de Chile con topologias simplificadas.

Stars: ✭ 23 (-62.9%)

Mutual labels: tidyverse

Tidyverse

Easily install and load packages from the tidyverse

Stars: ✭ 1,015 (+1537.1%)

Mutual labels: tidyverse

Moderndive book

Statistical Inference via Data Science: A ModernDive into R and the Tidyverse

Stars: ✭ 527 (+750%)

Mutual labels: tidyverse

Tidytext

Text mining using tidy tools ✨📄✨

Stars: ✭ 975 (+1472.58%)

Mutual labels: tidyverse

Tidygraph

A tidy API for graph manipulation

Stars: ✭ 398 (+541.94%)

Mutual labels: tidyverse

Tidylog

Tidylog provides feedback about dplyr and tidyr operations. It provides wrapper functions for the most common functions, such as filter, mutate, select, and group_by, and provides detailed output for joins.

Stars: ✭ 428 (+590.32%)

Mutual labels: tidyverse

Tidymv

Tidy Model Visualisation for Generalised Additive Models

Stars: ✭ 25 (-59.68%)

Mutual labels: tidyverse

Statistical rethinking with brms ggplot2 and the tidyverse

The bookdown version lives here: https://bookdown.org/content/3890

Stars: ✭ 350 (+464.52%)

Mutual labels: tidyverse

Intro spatialr

Introduction to GIS and mapping in R with the sf package

Stars: ✭ 39 (-37.1%)

Mutual labels: tidyverse

rfordatasciencewiki

Resources for the R4DS Online Learning Community, including answer keys to the text

Stars: ✭ 40 (-35.48%)

Mutual labels: tidyverse

Talks

Repository of publicly available talks by Leon Eyrich Jessen, PhD. Talks cover Data Science and R in the context of research

Stars: ✭ 16 (-74.19%)

Mutual labels: tidyverse

R for data science

Materials for teaching R and tidyverse

Stars: ✭ 54 (-12.9%)

Mutual labels: tidyverse

Ggplot Courses

👨‍🏫 ggplot2 Teaching Material

Stars: ✭ 40 (-35.48%)

Mutual labels: tidyverse

Tidy Text Mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson

Stars: ✭ 961 (+1450%)

Mutual labels: tidyverse

View All Similar Projects ➔

tidylo: Weighted Tidy Log Odds Ratio

Authors: Julia Silge, Alex Hayes, Tyler Schnoebelen
License: MIT

How can we measure how the usage or frequency of some feature, such as words, differs across some group or set, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful?

Enter the weighted log odds, which tidylo provides an implementation for, using tidy data principles. In particular, here we use the method outlined in Monroe, Colaresi, and Quinn (2008) to weight the log odds ratio by a prior. By default, the prior is estimated from the data itself, an empirical Bayes approach, but an uninformative prior is also available.

Installation

You can install the released version of tidylo from CRAN with:

install.packages("tidylo")

Or you can install the development version from GitHub with remotes:

library(remotes)
install_github("juliasilge/tidylo", ref = "main")

Example

Using weighted log odds is a great approach for text analysis when we want to measure how word usage differs across a set of documents. Let's explore the six published, completed novels of Jane Austen and use the tidytext package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(janeaustenr)
library(tidytext)

tidy_bigrams <- austen_books() %>%
     unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_counts <- tidy_bigrams %>%
     count(book, bigram, sort = TRUE)

bigram_counts
#> # A tibble: 328,495 x 3
#>    book                bigram     n
#>    <fct>               <chr>  <int>
#>  1 Mansfield Park      of the   748
#>  2 Mansfield Park      to be    643
#>  3 Emma                to be    607
#>  4 Mansfield Park      in the   578
#>  5 Emma                of the   566
#>  6 Pride & Prejudice   of the   464
#>  7 Emma                it was   448
#>  8 Emma                in the   446
#>  9 Pride & Prejudice   to be    443
#> 10 Sense & Sensibility to be    436
#> # … with 328,485 more rows

Now let's use the bind_log_odds() function from the tidylo package to find the weighted log odds for each bigram. The weighted log odds computed by this function are also z-scores for the log odds; this quantity is useful for comparing frequencies across categories or sets but its relationship to an odds ratio is not straightforward after the weighting.

What are the bigrams with the highest weighted log odds for these books?

library(tidylo)

bigram_log_odds <- bigram_counts %>%
  bind_log_odds(book, bigram, n) 

bigram_log_odds %>%
  arrange(-log_odds_weighted)
#> # A tibble: 328,495 x 4
#>    book                bigram                n log_odds_weighted
#>    <fct>               <chr>             <int>             <dbl>
#>  1 Mansfield Park      sir thomas          287              28.3
#>  2 Pride & Prejudice   mr darcy            243              27.7
#>  3 Emma                mr knightley        269              27.5
#>  4 Emma                mrs weston          229              25.4
#>  5 Sense & Sensibility mrs jennings        199              25.2
#>  6 Persuasion          captain wentworth   170              25.1
#>  7 Mansfield Park      miss crawford       215              24.5
#>  8 Persuasion          mr elliot           147              23.3
#>  9 Emma                mr elton            190              23.1
#> 10 Emma                miss woodhouse      162              21.3
#> # … with 328,485 more rows

The bigrams more likely to come from each book, compared to the others, involve proper nouns. We can make a visualization as well.

library(ggplot2)

bigram_log_odds %>%
  group_by(book) %>%
  top_n(10) %>%
  ungroup %>%
  mutate(bigram = reorder(bigram, log_odds_weighted)) %>%
  ggplot(aes(bigram, log_odds_weighted, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, scales = "free") +
  coord_flip() +
  labs(x = NULL)
#> Selecting by log_odds_weighted

Community Guidelines

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 62

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗