All Projects → pistocop → subreddit-comments-dl

pistocop / subreddit-comments-dl

Licence: GPL-3.0 license
Download subreddit comments

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to subreddit-comments-dl

reddit-comment-bot
Reddit bot that auto replies to comments on set subreddits
Stars: ✭ 59 (+3.51%)
Mutual labels:  reddit, subreddit, praw
Redditdownloader
Scrapes Reddit to download media of your choice.
Stars: ✭ 521 (+814.04%)
Mutual labels:  scraper, reddit
Praw
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.
Stars: ✭ 2,675 (+4592.98%)
Mutual labels:  reddit, praw
Media Scraper
Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
Stars: ✭ 206 (+261.4%)
Mutual labels:  scraper, reddit
timesearch
The subreddit archiver
Stars: ✭ 114 (+100%)
Mutual labels:  reddit, pushshift
Liked-Saved-Image-Downloader
Save content you enjoy!
Stars: ✭ 80 (+40.35%)
Mutual labels:  reddit, praw
Spam Bot 3000
Social media research and promotion, semi-autonomous CLI bot
Stars: ✭ 79 (+38.6%)
Mutual labels:  scraper, reddit
Multithreaded-Reddit-Image-Downloader
Does exactly what it says on the tin.
Stars: ✭ 38 (-33.33%)
Mutual labels:  reddit, praw
crypto-subreddits-cli
👽 Track Cryptocurrency Subreddits On The Command Line 👽
Stars: ✭ 24 (-57.89%)
Mutual labels:  reddit, subreddit
RedditExtractor
A minimalistic R wrapper for the Reddit API
Stars: ✭ 58 (+1.75%)
Mutual labels:  scraper, reddit
reddit
Reddit client for Bobby B Bot
Stars: ✭ 62 (+8.77%)
Mutual labels:  reddit, praw
subreddit-css
used for /r/web_design and /r/graphic_design
Stars: ✭ 44 (-22.81%)
Mutual labels:  reddit, subreddit
scripts
A collection of random scripts I coded up
Stars: ✭ 17 (-70.18%)
Mutual labels:  scraper, reddit
PrawWallpaperDownloader
Download images from reddit
Stars: ✭ 18 (-68.42%)
Mutual labels:  reddit, praw
RepostCheckerBot
Bot for checking reposts on reddit
Stars: ✭ 36 (-36.84%)
Mutual labels:  reddit, praw
Skraper
Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Facebook, Instagram, Twitter, Youtube, Tiktok, Telegram, Twitch, Reddit, 9GAG, Pinterest, Flickr, Tumblr, IFunny, VK, Pikabu)
Stars: ✭ 72 (+26.32%)
Mutual labels:  scraper, reddit
bdfr-html
Converts the output of the bulk downloader for reddit to a set of HTML pages.
Stars: ✭ 23 (-59.65%)
Mutual labels:  reddit, pushshift
reddit-fetch
A program to fetch some comments/pictures from reddit
Stars: ✭ 50 (-12.28%)
Mutual labels:  reddit, subreddit
vreddit-mirror-bot
🎥 Reddit bot that mirrors videos hosted on the native Reddit player to Gfycat and Streamable.
Stars: ✭ 23 (-59.65%)
Mutual labels:  reddit, praw
cat-message
Finds cat images/videos/gifs on reddit, sends them to my mom via applescript
Stars: ✭ 35 (-38.6%)
Mutual labels:  scraper, reddit

subreddit-comments-dl

Gitmoji

Download all the text comments from a subreddit

Use the script subreddit_downloader.py multiple times to download the data.
Then run the script dataset_builder.py for create a unique dataset.

🖱 More info on website and medium.

🚀 Usage

Basic usage to download submissions and relative comments from subreddit AskReddit and News:

# Use python 3.8.5

# Install the dependencies
pip install -r requirements.txt

# Download the AskReddit comments of the last 30 submissions
python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username>

# Download the News comments after 1 January 2020
python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459201

# Build the dataset, the results will be under `./dataset/` path
python src/dataset_builder.py 

ℹ️ Where I can get the reddit parameters?

Parameter name Description How get it Example of the value
reddit_id The Client ID generated from the apps page Official guide 40oK80pF8ac3Cn
reddit_secret The secret generated from the apps page Copy the value as showed here 9KEUOE7pi8dsjs9507asdeurowGCcg
reddit_username The reddit account name The name you use for log in pistoSniffer

⬇️ Output

A new folder with two csv files are created from dataset_builder.py, the script have some features:

  • Remove rows with same id
  • Have a caching_size parameter to don't store all dataset in RAM

They have the following structure:

submissions.csv

Each row is a submission of a specific subreddit and id field is unique across the dataset (PK).

Column name Description Example
subreddit Name of the subreddit MTB
id Unique identifier of the submission lhr2bo
created_utc UTC when submission was created 1613068060
title Title of the submission Must ride So...
selftext Text off the submission What are the best trails to ride in...
full_link Reddit unique link to the submission https://www.reddit.com/r/MTB/comments/lhr2bo/must_ride_so_cali_trails/

comments.csv

Each row is a comment under a submission of a specific subreddit and id field is unique across the dataset (PK).

Column name Description Example
subreddit Name of the subreddit News
id Unique identifier of the comment gmz45xo
submission_id Id of the comment main submission lhr2bo
body Text of the comment We're past the point...
created_utc UTC when comment was created 1613072734
parent_id Id of the parent in a tree structure t3_lhssi4
permalink Reddit unique link to the comment /r/news/comments/lhssi4/air_force_wants_to_know_if_key_pacific_airfield/gmz45xo/

📖 Glossary

  • subreddit: section of reddit website focused on a particular topic

  • submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of _ comments_

  • comment: text wrote by a reddit user under a submission inside a subreddit

    • The main goal of this repository is to gather the comments belong to the subreddit

✍️ Notes and Q&A

  • Under the hood the script use pushshift to gather submissions id, and praw for collect the submissions comments
    • With this approach we require fewer data to pushshift
    • Due to the usage of praw API, the reddit credentials are required
  • More info about the subreddit_downloader.py script under the --help command:
  • Other packages:
    • psaw: Python Pushshift.io API Wrapper
  • [?] Data empty CSV:
    • Sometimes we have an empty csv under /data/<subreddit>/<timestamp>/comments/xxx.csv
    • This behaviour is due of a batch of submissions that don't have comments, you can check this opening the /data/<subreddit>/<timestamp>/submissions/xxx.csv equivalent file (same xxx.csv name) and open the submission link
  • [?] The program stuck and don't run:
    • Call the program with --debug flag to get in which submission the program is freezing
    • Very probably the program is blocked on a submission that contains 10k> comments, and the praw API need to make a lot of requests to gather all the data (thus require a lot of time).
    • If you don't want to wait, or you want more control over the quantity of comments fetched per single submission, use the --comments-cap parameter.
    • If provided, the system requires new comments comments_cap times to the praw API, and don't download all comments.
      • More high the value, more comments will be downloaded
      • Set to 0 to download only the comments showed on the first page of the submission
      • Set to 64 to be enough sure that the system will download a good amount of data
      • Tune the parameter as your favor
python src/subreddit_downloader.py --help
Usage: subreddit_downloader.py [OPTIONS] SUBREDDIT

  Download all the submissions and relative comments from a subreddit.

Arguments:
  SUBREDDIT  The subreddit name  [required]

Options:
  --output-dir TEXT       Optional output directory  [default: ./data/]
  --batch-size INTEGER    Request `batch_size` submission per time  [default:
                          10]

  --laps INTEGER          How many times request `batch_size` reddit
                          submissions  [default: 3]

  --reddit-id TEXT        Reddit client_id, visit https://github.com/reddit-
                          archive/reddit/wiki/OAuth2  [required]

  --reddit-secret TEXT    Reddit client_secret, visit
                          https://github.com/reddit-archive/reddit/wiki/OAuth2
                          [required]

  --reddit-username TEXT  Reddit username, used for build the `user_agent`
                          string, visit https://github.com/reddit-
                          archive/reddit/wiki/API  [required]

  --utc-after TEXT        Fetch the submissions before this UTC date
  --utc-before TEXT       Fetch the submissions before this UTC date
  --comments-cap INTEGER  Some submissions have 10k> nested comments and stuck
                          the praw API call.If provided, the system requires
                          new comments `comments_cap` times to the praw
                          API.`comments_cap` under the hood will be passed
                          directly to `replace_more` function as `limit`
                          parameter. For more info see the README and visit ht
                          tps://asyncpraw.readthedocs.io/en/latest/code_overvi
                          ew/other/commentforest.html#asyncpraw.models.comment
                          _forest.CommentForest.replace_more.

  --debug / --no-debug    Enable debug logging  [default: False]
  --install-completion    Install completion for the current shell.
  --show-completion       Show completion for the current shell, to copy it or
                          customize the installation.

  --help                  Show this message and exit.

💤 TODO

dataset_builder.py

  • store some dataset info (subreddit, max/min utc/human, n^ lines)

subreddit_downloader.py

  • use async function if possible to gather more data concurrently
  • load user credentials in subreddit_downloader.py from local config file
  • store/log the utc and human datetime
  • use case: download all data from X datetime until now
    • early stopping if no new data fetched
  • refactory of dataset_builder.py:_rows_parser: find a more efficient approach to check id duplicates
    • maybe switch to use pandas as matrix manager
  • should switch to use psaw?
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].