All Projects → toddwschneider → Rdrreddit

toddwschneider / Rdrreddit

Programming Languages

ruby
36898 projects - #4 most used programming language

rdrreddit

Materials in support of this post: http://toddwschneider.com/posts/the-reddit-front-page-is-not-a-meritocracy/

There are 3 main components to the repo:

1. Rails application that grabs the top 100 items from reddit every 5 minutes

The app is not intended to be used as a web server, just as a clock process and delayed job worker. You can run it with:

bundle exec foreman start -f Procfile.clockandworker

The clock dumps a blob of serialized text into the reddit_observations table every 5 minutes, then a delayed job worker processes each of those blobs into the posts and observations tables. Some additional methods cache a few attributes on those tables, and fetch data fromt the Imgur API -- these methods are run manually from the Rails console

2. R scripts for data analysis

reddit_analysis.R does the heavy lifting

3. Postgres database dump file

rdr_seed.dump contains data from the reddit top 100 between September 15 and October 31, 2014

It includes only the posts and observations tables -- the raw content in reddit_observations table would take up too much space, and none of the analysis depends on that table anyway. You can restore the database on your local machine with pg_restore (you have to install postgres first if you haven't yet):

pg_restore --verbose --clean --no-acl --no-owner -h localhost -d rdrreddit_development /path/to/rdr_seed.dump

The dump file is about 25 mb compressed, and will take up 175 mb on disk once fully restored

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].