subreddit-comments-dl

Download all the text comments from a subreddit

Use the script subreddit_downloader.py multiple times to download the data.
Then run the script dataset_builder.py for create a unique dataset.

🖱 More info on website and medium.

🚀 Usage

Basic usage to download submissions and relative comments from subreddit AskReddit and News:

# Use python 3.8.5

# Install the dependencies
pip install -r requirements.txt

# Download the AskReddit comments of the last 30 submissions
python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username>

# Download the News comments after 1 January 2020
python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459201

# Build the dataset, the results will be under `./dataset/` path
python src/dataset_builder.py

ℹ️ Where I can get the reddit parameters?

Parameters indicated with <...> on the previous script
Official Reddit guide
TLDR: read this stack overflow

Parameter name	Description	How get it	Example of the value
`reddit_id`	The Client ID generated from the apps page	Official guide	40oK80pF8ac3Cn
`reddit_secret`	The secret generated from the apps page	Copy the value as showed here	9KEUOE7pi8dsjs9507asdeurowGCcg
`reddit_username`	The reddit account name	The name you use for log in	pistoSniffer

⬇️ Output

A new folder with two csv files are created from dataset_builder.py, the script have some features:

Remove rows with same id
Have a caching_size parameter to don't store all dataset in RAM

They have the following structure:

submissions.csv

Each row is a submission of a specific subreddit and id field is unique across the dataset (PK).

Column name	Description	Example
subreddit	Name of the subreddit	MTB
id	Unique identifier of the submission	lhr2bo
created_utc	UTC when submission was created	1613068060
title	Title of the submission	Must ride So...
selftext	Text off the submission	What are the best trails to ride in...
full_link	Reddit unique link to the submission	https://www.reddit.com/r/MTB/comments/lhr2bo/must_ride_so_cali_trails/

comments.csv

Each row is a comment under a submission of a specific subreddit and id field is unique across the dataset (PK).

Column name	Description	Example
subreddit	Name of the subreddit	News
id	Unique identifier of the comment	gmz45xo
submission_id	Id of the comment main submission	lhr2bo
body	Text of the comment	We're past the point...
created_utc	UTC when comment was created	1613072734
parent_id	Id of the parent in a tree structure	t3_lhssi4
permalink	Reddit unique link to the comment	/r/news/comments/lhssi4/air_force_wants_to_know_if_key_pacific_airfield/gmz45xo/

📖 Glossary

subreddit: section of reddit website focused on a particular topic
submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of _ comments_
comment: text wrote by a reddit user under a submission inside a subreddit
- The main goal of this repository is to gather the comments belong to the subreddit

✍️ Notes and Q&A

Under the hood the script use pushshift to gather submissions id, and praw for collect the submissions comments
- With this approach we require fewer data to pushshift
- Due to the usage of praw API, the reddit credentials are required
More info about the subreddit_downloader.py script under the --help command:
Other packages:
- psaw: Python Pushshift.io API Wrapper
[?] Data empty CSV:
- Sometimes we have an empty csv under /data/<subreddit>/<timestamp>/comments/xxx.csv
- This behaviour is due of a batch of submissions that don't have comments, you can check this opening the /data/<subreddit>/<timestamp>/submissions/xxx.csv equivalent file (same xxx.csv name) and open the submission link
[?] The program stuck and don't run:
- Call the program with --debug flag to get in which submission the program is freezing
- Very probably the program is blocked on a submission that contains 10k> comments, and the praw API need to make a lot of requests to gather all the data (thus require a lot of time).
- If you don't want to wait, or you want more control over the quantity of comments fetched per single submission, use the --comments-cap parameter.
- If provided, the system requires new comments comments_cap times to the praw API, and don't download all comments.
  - More high the value, more comments will be downloaded
  - Set to 0 to download only the comments showed on the first page of the submission
  - Set to 64 to be enough sure that the system will download a good amount of data
  - Tune the parameter as your favor

python src/subreddit_downloader.py --help
Usage: subreddit_downloader.py [OPTIONS] SUBREDDIT

  Download all the submissions and relative comments from a subreddit.

Arguments:
  SUBREDDIT  The subreddit name  [required]

Options:
  --output-dir TEXT       Optional output directory  [default: ./data/]
  --batch-size INTEGER    Request `batch_size` submission per time  [default:
                          10]

  --laps INTEGER          How many times request `batch_size` reddit
                          submissions  [default: 3]

  --reddit-id TEXT        Reddit client_id, visit https://github.com/reddit-
                          archive/reddit/wiki/OAuth2  [required]

  --reddit-secret TEXT    Reddit client_secret, visit
                          https://github.com/reddit-archive/reddit/wiki/OAuth2
                          [required]

  --reddit-username TEXT  Reddit username, used for build the `user_agent`
                          string, visit https://github.com/reddit-
                          archive/reddit/wiki/API  [required]

  --utc-after TEXT        Fetch the submissions before this UTC date
  --utc-before TEXT       Fetch the submissions before this UTC date
  --comments-cap INTEGER  Some submissions have 10k> nested comments and stuck
                          the praw API call.If provided, the system requires
                          new comments `comments_cap` times to the praw
                          API.`comments_cap` under the hood will be passed
                          directly to `replace_more` function as `limit`
                          parameter. For more info see the README and visit ht
                          tps://asyncpraw.readthedocs.io/en/latest/code_overvi
                          ew/other/commentforest.html#asyncpraw.models.comment
                          _forest.CommentForest.replace_more.

  --debug / --no-debug    Enable debug logging  [default: False]
  --install-completion    Install completion for the current shell.
  --show-completion       Show completion for the current shell, to copy it or
                          customize the installation.

  --help                  Show this message and exit.

💤 TODO

dataset_builder.py

store some dataset info (subreddit, max/min utc/human, n^ lines)

subreddit_downloader.py

use async function if possible to gather more data concurrently
load user credentials in subreddit_downloader.py from local config file
store/log the utc and human datetime
use case: download all data from X datetime until now
- early stopping if no new data fetched
refactory of dataset_builder.py:_rows_parser: find a more efficient approach to check id duplicates
- maybe switch to use pandas as matrix manager
should switch to use psaw?

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

pistocop / subreddit-comments-dl

Programming Languages

Labels

Projects that are alternatives of or similar to subreddit-comments-dl