All Projects → nicodds → chesf

nicodds / chesf

Licence: Apache-2.0 license
CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to chesf

schedule-tweet
Schedules tweets using TweetDeck
Stars: ✭ 14 (-22.22%)
Mutual labels:  scraping, selenium, webscraping
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+1227.78%)
Mutual labels:  scraping, selenium
Thal
Getting started with Puppeteer and Chrome Headless for Web Scraping
Stars: ✭ 2,345 (+12927.78%)
Mutual labels:  scraping, chrome-headless
ioweb
Web Scraping Framework
Stars: ✭ 31 (+72.22%)
Mutual labels:  scraping, webscraping
Seleniumcrawler
An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site
Stars: ✭ 117 (+550%)
Mutual labels:  scraping, selenium
Udemycoursegrabber
Your will to enroll in Udemy course is here, but the money isn't? Search no more! This python program searches for your desired course in more than [insert big number here] websites, compares the last updated date, and gives you the download link of the latest one back, but you also have the choice to see the other ones as well!
Stars: ✭ 137 (+661.11%)
Mutual labels:  scraping, selenium
RARBG-scraper
With Selenium headless browsing and CAPTCHA solving
Stars: ✭ 38 (+111.11%)
Mutual labels:  scraping, selenium
Gazpacho
🥫 The simple, fast, and modern web scraping library
Stars: ✭ 525 (+2816.67%)
Mutual labels:  scraping, webscraping
Sneakers Project
Using Selenium, Neha scraped data about 35 top selling sneakers of Nike and Adidas from stockx.com. She used this data to draw insights about sneaker resales.
Stars: ✭ 32 (+77.78%)
Mutual labels:  selenium, webscraping
non-api-fb-scraper
Scrape public FaceBook posts from any group or user into a .csv file without needing to register for any API access
Stars: ✭ 40 (+122.22%)
Mutual labels:  selenium, webscraping
docker-selenium-lambda
The simplest demo of chrome automation by python and selenium in AWS Lambda
Stars: ✭ 172 (+855.56%)
Mutual labels:  scraping, selenium
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (+455.56%)
Mutual labels:  scraping, webscraping
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+5588.89%)
Mutual labels:  scraping, webscraping
Panther
A browser testing and web crawling library for PHP and Symfony
Stars: ✭ 2,480 (+13677.78%)
Mutual labels:  scraping, selenium
Configs
Public, free to use, repository with diggers configs for scraping / extracting data from various e-commerce websites and online stores
Stars: ✭ 37 (+105.56%)
Mutual labels:  scraping, webscraping
fBrowser
Helpful Selenium functions to make web-scraping easier and faster
Stars: ✭ 16 (-11.11%)
Mutual labels:  selenium, webscraping
InstaBot
Simple and friendly Bot for Instagram, using Selenium and Scrapy with Python.
Stars: ✭ 32 (+77.78%)
Mutual labels:  scraping, selenium
Post Tuto Deployment
Build and deploy a machine learning app from scratch 🚀
Stars: ✭ 368 (+1944.44%)
Mutual labels:  scraping, selenium
Undetected Chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
Stars: ✭ 365 (+1927.78%)
Mutual labels:  scraping, selenium
selenium-grid-docker-swarm
web scraping in parallel with Selenium Grid and Docker
Stars: ✭ 32 (+77.78%)
Mutual labels:  selenium, webscraping

Introduction

In the era of Big Data, the web is an endless source of information. For this reason, there are plenty of good tools/frameworks to perform scraping of web pages.

So, I guess, in an ideal world there should be no need of a new web scraping framework. Nevertheless, there are always subtle differences between theory and practice. The case of web scraping made no exceptions.

Real world web pages are often full of javascript codes that alter the DOM as the user requests/navigates pages. Consequently, scraping javascript intensive web pages could be impossible.

Such considerations were the sparks that gave birth to CHeSF, the Chrome Headless Scraping Framework. To make a long story short, CHeSF relies on both selenium-python and ChromeDriver to perform scraping of webpages also when javascript makes it impossible.

I know that already exists some nice solutions to this problems, but in my point of view CHeSF is simpler: you just create a class that inherits from it, define the parse method and launch it with a start url.

The framework is still very alpha. You should expect that things could change rapidly. Currently, there is no documentation, nor packaging. There is just an example showing how you could use the framework to easily scrape TripAdvisor reviews. Personally, I used it to collect this dataset, i.e. a collection of more than 220k TripAdvisor reviews.

Basic usage

CHeSF borrows its working philosophy (in part) from Scrapy, i.e. making a scraping tool means creating (at least) a python class.

import sys
import os

# the path to the crhome driver executable
path_to_chrome_driver_exe = 'path_to_chromedriver.exe'
# currently, no packages exists for CHeSF, so use this hack until 
# I'll have some free time to implement packaging
path_to_chesf = 'path_to_chesf_in_your_system'

sys.path.insert(0, os.path.abspath(path_to_chesf))
from chesf import CHeSF, MAX_ATTEMPTS

class TripAdvisorScraper(CHeSF):
    def __init__(self):
        super().__init__(path_to_chrome_driver_exe, debug=False)
        

    # this is the core of the Scraper, you must define it since by
    # convention is the callback called with the first url passed,
    # after, you can define other callbacks
    def parse(self):
        # the main pro of CHeSF is that you could use directly
        # javascript to parse the page
        script = """
	       let urls = [];
	       let anchors = document.querySelectorAll("a.property_title.prominent");
        
    	   for (let a of anchors)
                urls.push(a.href);

    	   return urls;
        """

        # the array returned from the javascript is automagically
        # transformed to a python list (this is selenium magic)
        links = self.call_js(script)

        for link in links:
            print(link)

        # you could use both xpath and css selectors (just change the
        # method you use)
        next_page = self.css('a.nav.next.taLnk.ui_button.primary', timeout=1)
        
        if len(next_page) > 0:
            # clicks are immediately executed
            self.enqueue_click(next_page[0], self.parse)
            
start_url = 'https://www.tripadvisor.com/Hotels-g187791-c2-Rome_Lazio-Hotels.html'
scraper = TripAdvisorScraper()

try:
    scraper.start(start_url)
except:
    scraper.quit()
    raise

Contacts

In case of questions and/or suggestions, write me a note using my GitHub contact email.

Mini FAQ

Q. Hey man, it absolutely doesn't work! What's wrong? A. Please, check that your ChromeDriver is suitable for the Chrome version you are using.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].