All Projects → OmarEinea → GoodReadsScraper

OmarEinea / GoodReadsScraper

Licence: GPL-3.0 license
📚 A GoodReads.com Scraper script to get books reviews including text and rating.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to GoodReadsScraper

BookingScraper
🌎 🏨 Scrape Booking.com 🏨 🌎
Stars: ✭ 68 (+88.89%)
Mutual labels:  beautifulsoup, webscraping
OkanimeDownloader
Scrape your favorite Anime from Okanime.com without effort
Stars: ✭ 13 (-63.89%)
Mutual labels:  beautifulsoup, webscraping
non-api-fb-scraper
Scrape public FaceBook posts from any group or user into a .csv file without needing to register for any API access
Stars: ✭ 40 (+11.11%)
Mutual labels:  beautifulsoup, webscraping
Sig To Googlecalendar
A python script to get class schedules on UFLA's SIG and convert to a .CSV file to use in Google Calendar
Stars: ✭ 14 (-61.11%)
Mutual labels:  beautifulsoup, webscraping
PacPaw
Pawn package manager for SA-MP
Stars: ✭ 14 (-61.11%)
Mutual labels:  beautifulsoup, webscraping
Soup
Web Scraper in Go, similar to BeautifulSoup
Stars: ✭ 1,685 (+4580.56%)
Mutual labels:  beautifulsoup, webscraping
Jssoup
JavaScript + BeautifulSoup = JSSoup
Stars: ✭ 203 (+463.89%)
Mutual labels:  beautifulsoup
goodreads-to-sqlite
Export your (or other people's) Goodreads data to SQLite
Stars: ✭ 62 (+72.22%)
Mutual labels:  goodreads
Pornhub Api
Unofficial API for PornHub.com in Python
Stars: ✭ 181 (+402.78%)
Mutual labels:  beautifulsoup
Requests Html
Pythonic HTML Parsing for Humans™
Stars: ✭ 12,268 (+33977.78%)
Mutual labels:  beautifulsoup
NordVPN-switcher
Rotate between different NordVPN servers with ease. Works both on Linux and Windows without any required changes to your code!
Stars: ✭ 143 (+297.22%)
Mutual labels:  webscraping
Goodreads visualization
A Jupyter notebook where I play with my Goodreads data
Stars: ✭ 51 (+41.67%)
Mutual labels:  goodreads
BooksAndBot
Telegram inline bot. Search for books and share them in a conversation
Stars: ✭ 26 (-27.78%)
Mutual labels:  goodreads
Csdnbot
CSDN 资源下载器
Stars: ✭ 209 (+480.56%)
Mutual labels:  beautifulsoup
alfred-goodreads-workflow
No description or website provided.
Stars: ✭ 20 (-44.44%)
Mutual labels:  goodreads
Bet On Sibyl
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
Stars: ✭ 190 (+427.78%)
Mutual labels:  beautifulsoup
aws-pdf-textract-pipeline
🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
Stars: ✭ 141 (+291.67%)
Mutual labels:  webscraping
goodreads-sh
📙Command line interface for Goodreads.com. Written in Rust.
Stars: ✭ 27 (-25%)
Mutual labels:  goodreads
computer book list
一个综合了豆瓣,goodreads综合评分的计算机书籍书单
Stars: ✭ 1,535 (+4163.89%)
Mutual labels:  goodreads
ReaDB
ReaDB is your private digital bookshelf. Read. Review. Remember.
Stars: ✭ 84 (+133.33%)
Mutual labels:  goodreads

GoodReads Reviews Scraper

This is a python 3 web scraping script to get books reviews from goodreads.com,
using the web browser automation tool Selenium, and BeautifulSoup for pulling data out of HTML.
I used it to scrape around 700k Arabic reviews in 2018 (Arabic reviews are fewer than English ones).

Papers

We experimented on the collected data and published details about it in two research papers:

Contents

  • Analyzer.py: short script to display some statistics about scraped books reviews
  • Books.py: class to scrape books ids from goodreads list or lists search or shelf
  • Browser.py: subclass of Chrome WebDriver class that's specialized for GoodrReads browsing
  • Reviews.py: class to scrape reviews from goodreads books using books ids
  • Sample.py: sample script showing a complete use of the scraper with error handling
  • Tools.py: set of function tools that are used in other scripts
  • Writer.py: class to write scraped reviews to files
  • requirements.txt: list of Required Python modules to be installed

Requirements

To install requirements listed in requirements.txt, you'll need to run this (depends on your os):

pip install -r requirements.txt

Also, as this is using Selenium to control the Chrome Browser,
so you'll need to download its driver for your specific os from here.

Documentation

  • A Books object (from Books.py) represents the books that are needed to be scraped.
    class Books() doesn't take any arguments

      Notes for the next two methods:
      browse could be one of the following:
      "shelf", "author", "lists" or "list" (by default)
      the keyword could be the id of a "shelf", an "author" or a "list"
      or it could be the search keyword in case you're searching for "lists"
    
    • Books.get_books(keyword, browse="list")
      Scrapes books ids an returns an array of them.

    • Books.output_books(keyword=None, browse="list", file_name="books")
      Scrapes books ids an writes them to a file set by
      sending file_name value without extension if none
      is sent, it'll write them to books.txt file by default

    • Books.append_books(books_ids)
      Append an external books ids array to class storage
      (Hint: it accepts what Books.get_books() returns)

  • A Reviews object (from Reviews.py) represents the scraped reviews.
    class Reviews(lang="ar")
    lang is the language of reviews to look for / scrape, it could be: (ISO 639-1 codes)

      af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
      hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
      pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
    
    • Reviews.output_book_reviews(book_id)
      Scrapes a book reviews and writes them to a file.

    • Reviews.output_books_reviews(self, books_ids, consider_previous=True)
      Scrapes reviews of an array of books and writes them to a file.
      consider_previous is set to whether to consider the books that have
      been already scraped or whether to delete them an start over

    • Reviews.wr.read_books(file_name="books")
      Reads books ids from file and returns them.

  • Tools module methods:

    • Manager.count_files_lines()
      Returns and prints the total sum of scraped books lines

    • Manager.delete_repeated_reviews()
      Returns unique reviews ids and delete all repeated ones and prints info

    • Manager.combine_reviews()
      Writes a single "reviews.txt" file containing all reviews

    • Manager.split_reviews(n)
      splits the combined "reviews.txt" file into n smaller files

Browser and Writer classes are only for implementation inside Books and Reviews Classes

Demo

Import needed modules:

from Books import Books
from Reviews import Reviews
from Tools import *

Scrape books ids from books shelved as "arabic":

b = Books()
books_ids = b.get_books("arabic", "shelf")

Scrape books reviews and write them to a file:

r = Reviews("ar")
r.output_books_reviews(books_ids)

Filter Reviews then combine them:

delete_repeated_reviews()
combine_reviews()

A more comprehensive example can be found in Sample.py

Resources

Reference

Omar Einea [email protected]

Supervised by Dr. Ashraf Elnagar.

University of Sharjah, United Arab Emarites, July 2016

License

Copyright (C) 2019 by Omar Einea.

This is an open source tool licensed under GPL v3.0. Copy of the license can be found here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].