All Projects → ivan-rivera → RedditExtractor

ivan-rivera / RedditExtractor

Licence: GPL-3.0 license
A minimalistic R wrapper for the Reddit API

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to RedditExtractor

saveddit
Bulk Downloader for Reddit
Stars: ✭ 130 (+124.14%)
Mutual labels:  scraper, reddit
cat-message
Finds cat images/videos/gifs on reddit, sends them to my mom via applescript
Stars: ✭ 35 (-39.66%)
Mutual labels:  scraper, reddit
scripts
A collection of random scripts I coded up
Stars: ✭ 17 (-70.69%)
Mutual labels:  scraper, reddit
subreddit-comments-dl
Download subreddit comments
Stars: ✭ 57 (-1.72%)
Mutual labels:  scraper, reddit
Redditdownloader
Scrapes Reddit to download media of your choice.
Stars: ✭ 521 (+798.28%)
Mutual labels:  scraper, reddit
Spam Bot 3000
Social media research and promotion, semi-autonomous CLI bot
Stars: ✭ 79 (+36.21%)
Mutual labels:  scraper, reddit
Skraper
Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Facebook, Instagram, Twitter, Youtube, Tiktok, Telegram, Twitch, Reddit, 9GAG, Pinterest, Flickr, Tumblr, IFunny, VK, Pikabu)
Stars: ✭ 72 (+24.14%)
Mutual labels:  scraper, reddit
Media Scraper
Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
Stars: ✭ 206 (+255.17%)
Mutual labels:  scraper, reddit
Polite
Be nice on the web
Stars: ✭ 253 (+336.21%)
Mutual labels:  scraper
ac-react-reddit
A Reddit client built with React.js, next.js and styled-components. https://ac-react-reddit.herokuapp.com
Stars: ✭ 38 (-34.48%)
Mutual labels:  reddit
Instagram Proxy Api
CORS compliant API to access Instagram's public data
Stars: ✭ 245 (+322.41%)
Mutual labels:  scraper
TwitterScraper
Scrape a User's Twitter data! Bypass the 3,200 tweet API limit for a User!
Stars: ✭ 80 (+37.93%)
Mutual labels:  scraper
TradeTheEvent
Implementation of "Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading." In Findings of ACL2021
Stars: ✭ 64 (+10.34%)
Mutual labels:  scraper
Heroku ebooks
A script to generate Markov chains and to post to an _ebooks account on Twitter using Heroku
Stars: ✭ 251 (+332.76%)
Mutual labels:  scraper
reddit-place-2017
Archive of Reddit's r/place data, history and images
Stars: ✭ 50 (-13.79%)
Mutual labels:  reddit
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+312.07%)
Mutual labels:  scraper
Getsy
A simple browser/client-side web scraper.
Stars: ✭ 238 (+310.34%)
Mutual labels:  scraper
scrapetube
Get all videos from a youtube channel, get all videos from a playlist, get all videos that match a search
Stars: ✭ 120 (+106.9%)
Mutual labels:  scraper
google-scraper
This class can retrieve search results from Google.
Stars: ✭ 33 (-43.1%)
Mutual labels:  scraper
crypto-subreddits-cli
👽 Track Cryptocurrency Subreddits On The Command Line 👽
Stars: ✭ 24 (-58.62%)
Mutual labels:  reddit

R-CMD-check

logo

Summary

Reddit Extractor is an R package for extracting data out of Reddit. It allows you to:

  1. find subreddits based on a search query
  2. find a user and their Reddit history
  3. find URLs to threads of interest and retrieve comments out of these threads

Installation

The package can be installed directly from CRAN, using install.packages("RedditExtractoR") or directly from GitHub via devtools::install_github('ivan-rivera/RedditExtractor'). Note that the latest version of this package requires R 4.1. If you have an earlier version and are not ready to upgrade, then you can install the earlier version of this package with devtools::install_version("RedditExtractoR", version = "2.1.5", repos = "http://cran.us.r-project.org"). Beware that version 3+ introduces significant breaking changes!

Quick start

Let's suppose that I'd like to get the top posts from the r/cats subreddit. After importing the package with library(RedditExtractoR) here is how we can do it:

top_cats_urls <- find_thread_urls(subreddit="cats", sort_by="top")
str(top_cats_urls)
# 'data.frame':	999 obs. of  5 variables:
# $ date_utc : chr  "2021-08-15" "2021-08-13" "2021-08-11" "2021-08-07" ...
# $ title    : chr  "This went on for over 5 min" "Found this friendly stray a few months back. Now she lives with me and is doing awesome but still doesn\031t ha"| __truncated__ "Let's wake up now" "Meet the kitty sisters" ...
# $ text     : chr  "" "" "" "" ...
# $ subreddit: chr  "cats" "cats" "cats" "cats" ...
# $ comments : num  12 121 7 11 6 9 17 6 3 28 ...
# $ url      : chr  "https://www.reddit.com/r/cats/comments/p4zwkx/this_went_on_for_over_5_min/" "https://www.reddit.com/r/cats/comments/p3no3t/found_this_friendly_stray_a_few_months_back_now/" "https://www.reddit.com/r/cats/comments/p21kyf/lets_wake_up_now/" "https://www.reddit.com/r/cats/comments/oztuiq/meet_the_kitty_sisters/" ...

Note that we simply got the top threads from a subreddit of interest. We could also use keywords to look for threads of interest, e.g.: find_thread_urls(keywords="cute kittens").

In some situations this could well be all you are after, but in most cases you'll probably want to parse these URLs and retrieve their metadata and comments. Here we go:

threads_contents <- get_thread_content(top_cats_urls$url[1:2]) # for the sake of simplicity
str(threads_contents$threads) # thread metadata
# 'data.frame':	2 obs. of  14 variables:
# $ url                  : chr  "https://www.reddit.com/r/cats/comments/p4zwkx/this_went_on_for_over_5_min/" "https://www.reddit.com/r/cats/comments/p3no3t/found_this_friendly_stray_a_few_months_back_now/"
# $ author               : chr  "CasterQ" "Just2063"
# $ date                 : chr  "2021-08-15" "2021-08-13"
# $ title                : chr  "This went on for over 5 min" "Found this friendly stray a few months back. Now she lives with me and is doing awesome but still doesn\031t ha"| __truncated__
# $ text                 : chr  "" ""
# $ subreddit            : chr  "cats" "cats"
# $ score                : num  320 322
# $ upvotes              : num  320 322
# $ downvotes            : num  0 0
# $ up_ratio             : num  1 0.99
# $ total_awards_received: num  2 1
# $ golds                : num  0 0
# $ cross_posts          : num  0 0
# $ comments             : num  12 121
str(threads_contents$comments)
# 'data.frame':	132 obs. of  9 variables:
# $ url       : chr  "https://www.reddit.com/r/cats/comments/p4zwkx/this_went_on_for_over_5_min/" "https://www.reddit.com/r/cats/comments/p4zwkx/this_went_on_for_over_5_min/" "https://www.reddit.com/r/cats/comments/p4zwkx/this_went_on_for_over_5_min/" "https://www.reddit.com/r/cats/comments/p4zwkx/this_went_on_for_over_5_min/" ...
# $ author    : chr  "stinkadinkalink" "CasterQ" "Future_Branch_8629" "dancingwithpenguins" ...
# $ date      : chr  "2021-08-15" "2021-08-15" "2021-08-15" "2021-08-15" ...
# $ score     : num  21 10 6 3 5 3 13 2 2 2 ...
# $ upvotes   : num  21 10 6 3 5 3 13 2 2 2 ...
# $ downvotes : num  0 0 0 0 0 0 0 0 0 0 ...
# $ golds     : num  0 0 0 0 0 0 0 0 0 0 ...
# $ comment   : chr  "such a cute cat but omg how many monitors do they have!!!! that look like nasa type set up" "Haha my husband is a computer software engineer the extra monitors are for his work" "Totally came here to say I have monitor jealousy. Glad he can take a break with his buddy." "=\002 this is so cute!=;" ...
# $ comment_id: chr  "1" "1_1" "1_1_1" "2" ...

If you'd like to join comments and their parent threads, you can do this by the URL.

Sometimes you might actually be looking for subreddits rather than threads, if so, we've got you covered too. Let's assume that we are trying to find subreddits about cats:

cat_subreddits <- find_subreddits("cats")
# 'data.frame':	248 obs. of  6 variables:
# $ id         : chr  "3gl3k" "2vi0z" "30tmh" "2tteh" ...
# $ date_utc   : chr  "2016-09-28" "2012-11-08" "2014-03-05" "2012-03-29" ...
# $ subreddit  : chr  "MemeEconomy" "Awwducational" "TwoSentenceHorror" "Justrolledintotheshop" ...
# $ title      : chr  "MemeEconomy" "Awwducational" "Two-Sentence Horror Stories: Bite-sized scares. " "Just Rolled Into the Shop" ...
# $ description: chr  "/r/MemeEconomy is a place where individuals can buy, sell, share, make, and invest in templates freely.\n\n\nv2"| __truncated__ "Don't just waste time, learn something too!" "Give us your scariest story in two sentences (or less)!" "For those absolutely stupid things that you see people bring, roll, or toss into your place of business and the"| __truncated__ ...
# $ subscribers: num  1430743 2965466 678423 1365589 586672 ..

Now you could technically feed these subreddits in a loop into the thread finder to generate a massive dataset.

Lastly, let's suppose that you'd like to retrieve information about a particular user:

user <- "nationalgeographic"
nat_geo_user <- get_user_content(user)
str(nat_geo_user[[user]]$about)
# List of 7
# $ created_utc  : chr "2017-08-24"
# $ name         : chr "nationalgeographic"
# $ is_employee  : logi FALSE
# $ is_mod       : logi TRUE
# $ is_gold      : logi TRUE
# $ thread_karma : num 279068
# $ comment_karma: num 87406
str(nat_geo_user[[user]]$comments)
# 'data.frame':	997 obs. of  11 variables:
# $ url           : chr  "https://www.reddit.com/r/history/comments/anhdnl/im_historian_author_and_musician_mark_lee_gardner/" "https://www.reddit.com/r/history/comments/anhdnl/im_historian_author_and_musician_mark_lee_gardner/" "https://www.reddit.com/r/history/comments/anhdnl/im_historian_author_and_musician_mark_lee_gardner/" "https://www.reddit.com/r/history/comments/anhdnl/im_historian_author_and_musician_mark_lee_gardner/" ...
# $ date_utc      : chr  "2019-02-05" "2019-02-05" "2019-02-05" "2019-02-05" ...
# $ subreddit     : chr  "history" "history" "history" "history" ...
# $ thread_author : chr  "nationalgeographic" "nationalgeographic" "nationalgeographic" "nationalgeographic" ...
# $ comment_author: chr  "nationalgeographic" "nationalgeographic" "nationalgeographic" "nationalgeographic" ...
# $ thread_title  : chr  "I\031m historian, author, and musician Mark Lee Gardner and I can tell you a lot about Jesse James, the infamou"| __truncated__ "I\031m historian, author, and musician Mark Lee Gardner and I can tell you a lot about Jesse James, the infamou"| __truncated__ "I\031m historian, author, and musician Mark Lee Gardner and I can tell you a lot about Jesse James, the infamou"| __truncated__ "I\031m historian, author, and musician Mark Lee Gardner and I can tell you a lot about Jesse James, the infamou"| __truncated__ ...
# $ comment       : chr  "You're welcome.  Part of Jesse and Frank's success in eluding law enforcement (in addition to fast horses!) was"| __truncated__ "In 1874, the Missouri legislature passed the Suppression of Outlawry Act that set aside a \"state secret servic"| __truncated__ "Probably the quick-draw gunfight: two men staring each other down in the middle of the street and attempting to"| __truncated__ "That was actually part of the problem.  There was often little coordination between towns and the state.  In fa"| __truncated__ ...
# $ score         : num  2 2 9 4 8 6 6 10 13 2 ...
# $ up            : num  2 2 9 4 8 6 6 10 13 2 ...
# $ downs         : num  0 0 0 0 0 0 0 0 0 0 ...
# $ golds         : num  0 0 0 0 0 0 0 0 0 0 ...
str(nat_geo_user[[user]]$threads)
# 'data.frame':	999 obs. of  10 variables:
# $ url      : chr  "https://www.nationalgeographic.com/environment/2019/02/2018-fourth-warmest-year-ever-noaa-nasa-reports/?cmpid=o"| __truncated__ "https://www.reddit.com/r/history/comments/anhdnl/im_historian_author_and_musician_mark_lee_gardner/" "https://www.nationalgeographic.com/environment/2019/02/climate-change-alters-oceans-blues-greens/" "https://v.redd.it/ehqa55sbcge21" ...
# $ date_utc : chr  "2019-02-06" "2019-02-05" "2019-02-05" "2019-02-04" ...
# $ subreddit: chr  "u_nationalgeographic" "history" "u_nationalgeographic" "u_nationalgeographic" ...
# $ author   : chr  "nationalgeographic" "nationalgeographic" "nationalgeographic" "nationalgeographic" ...
# $ title    : chr  "The last five years were the hottest ever, NASA and NOAA declare" "I\031m historian, author, and musician Mark Lee Gardner and I can tell you a lot about Jesse James, the infamou"| __truncated__ "Climate change will shift the oceansâ\u0080\u0099 colors" "Welcome to our half-time show of only the most superb owls" ...
# $ text     : chr  "" "Edit: Thanks so much for your questions.  They were excellent!  I've got to run now, but be sure to check out m"| __truncated__ "" "" ...
# $ golds    : num  0 0 0 0 2 0 1 0 0 0 ...
# $ score    : num  303 62 165 97 6406 ...
# $ ups      : num  303 62 165 97 6406 ...
# $ downs    : num  0 0 0 0 0 0 0 0 0 0 ...

Note that the above function also works for a vector of usernames, so you could use it as: get_user_content(c("memes","nationalgeographic"))

That's all there is to it!

Contributing

If you'd like to improve this package, there are several ways that you can help. First, if you spot a bug or if you'd like to propose a new feature, then please create an issue. If you'd like to implement a bugfix or a new feature, then please create a pull request.

FAQ


  • Question: Why can I not get all the comments out of a thread?
  • Answer: The Reddit API limits how much data you can retrieve and there is no way around it right now. If you'd like to get a larger sample of data, consider trying data dumps like PushShift.

  • Question: All functions in this library appear to be a little slow, why is that?
  • Answer: The Reddit API allows user to make 60 requests per minute (1 request per second), which is why URL parsers used in this library intentionally limit requests to conform to the API requirements

  • Question: find_thread_urls and find_subreddits functions do not always include keyword terms used in the search, is that a bug?
  • Answer: No it is not a bug. We simply pass your search query to the Reddit API and it returns whatever results it can find semantically based not only on your keyword inputs but also your sort_by choice

  • Question: I'd like to add more functionality into the package, can you help?
  • Answer: You might like to check the Reddit API to see if your idea is feasible and if it is, then either create an issue or a pull request.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].