Alternatives and detailed information of chesf

nicodds / chesf

Licence: Apache-2.0 license

CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to chesf

schedule-tweet

Schedules tweets using TweetDeck

Stars: ✭ 14 (-22.22%)

Mutual labels: scraping, selenium, webscraping

Scrape Linkedin Selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Stars: ✭ 239 (+1227.78%)

Mutual labels: scraping, selenium

Thal

Getting started with Puppeteer and Chrome Headless for Web Scraping

Stars: ✭ 2,345 (+12927.78%)

Mutual labels: scraping, chrome-headless

ioweb

Web Scraping Framework

Stars: ✭ 31 (+72.22%)

Mutual labels: scraping, webscraping

Seleniumcrawler

An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site

Stars: ✭ 117 (+550%)

Mutual labels: scraping, selenium

Udemycoursegrabber

Your will to enroll in Udemy course is here, but the money isn't? Search no more! This python program searches for your desired course in more than [insert big number here] websites, compares the last updated date, and gives you the download link of the latest one back, but you also have the choice to see the other ones as well!

Stars: ✭ 137 (+661.11%)

Mutual labels: scraping, selenium

RARBG-scraper

With Selenium headless browsing and CAPTCHA solving

Stars: ✭ 38 (+111.11%)

Mutual labels: scraping, selenium

Gazpacho

🥫 The simple, fast, and modern web scraping library

Stars: ✭ 525 (+2816.67%)

Mutual labels: scraping, webscraping

Sneakers Project

Using Selenium, Neha scraped data about 35 top selling sneakers of Nike and Adidas from stockx.com. She used this data to draw insights about sneaker resales.

Stars: ✭ 32 (+77.78%)

Mutual labels: selenium, webscraping

non-api-fb-scraper

Scrape public FaceBook posts from any group or user into a .csv file without needing to register for any API access

Stars: ✭ 40 (+122.22%)

Mutual labels: selenium, webscraping

docker-selenium-lambda

The simplest demo of chrome automation by python and selenium in AWS Lambda

Stars: ✭ 172 (+855.56%)

Mutual labels: scraping, selenium

Dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

Stars: ✭ 100 (+455.56%)

Mutual labels: scraping, webscraping

Django Dynamic Scraper

Creating Scrapy scrapers via the Django admin interface

Stars: ✭ 1,024 (+5588.89%)

Mutual labels: scraping, webscraping

Panther

A browser testing and web crawling library for PHP and Symfony

Stars: ✭ 2,480 (+13677.78%)

Mutual labels: scraping, selenium

Configs

Public, free to use, repository with diggers configs for scraping / extracting data from various e-commerce websites and online stores

Stars: ✭ 37 (+105.56%)

Mutual labels: scraping, webscraping

fBrowser

Helpful Selenium functions to make web-scraping easier and faster

Stars: ✭ 16 (-11.11%)

Mutual labels: selenium, webscraping

InstaBot

Simple and friendly Bot for Instagram, using Selenium and Scrapy with Python.

Stars: ✭ 32 (+77.78%)

Mutual labels: scraping, selenium

Post Tuto Deployment

Build and deploy a machine learning app from scratch 🚀

Stars: ✭ 368 (+1944.44%)

Mutual labels: scraping, selenium

Undetected Chromedriver

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

Stars: ✭ 365 (+1927.78%)

Mutual labels: scraping, selenium

selenium-grid-docker-swarm

web scraping in parallel with Selenium Grid and Docker

Stars: ✭ 32 (+77.78%)

Mutual labels: selenium, webscraping

View All Similar Projects ➔

Introduction

In the era of Big Data, the web is an endless source of information. For this reason, there are plenty of good tools/frameworks to perform scraping of web pages.

So, I guess, in an ideal world there should be no need of a new web scraping framework. Nevertheless, there are always subtle differences between theory and practice. The case of web scraping made no exceptions.

Real world web pages are often full of javascript codes that alter the DOM as the user requests/navigates pages. Consequently, scraping javascript intensive web pages could be impossible.

Such considerations were the sparks that gave birth to CHeSF, the Chrome Headless Scraping Framework. To make a long story short, CHeSF relies on both selenium-python and ChromeDriver to perform scraping of webpages also when javascript makes it impossible.

I know that already exists some nice solutions to this problems, but in my point of view CHeSF is simpler: you just create a class that inherits from it, define the parse method and launch it with a start url.

The framework is still very alpha. You should expect that things could change rapidly. Currently, there is no documentation, nor packaging. There is just an example showing how you could use the framework to easily scrape TripAdvisor reviews. Personally, I used it to collect this dataset, i.e. a collection of more than 220k TripAdvisor reviews.

Basic usage

CHeSF borrows its working philosophy (in part) from Scrapy, i.e. making a scraping tool means creating (at least) a python class.

import sys
import os

# the path to the crhome driver executable
path_to_chrome_driver_exe = 'path_to_chromedriver.exe'
# currently, no packages exists for CHeSF, so use this hack until 
# I'll have some free time to implement packaging
path_to_chesf = 'path_to_chesf_in_your_system'

sys.path.insert(0, os.path.abspath(path_to_chesf))
from chesf import CHeSF, MAX_ATTEMPTS

class TripAdvisorScraper(CHeSF):
    def __init__(self):
        super().__init__(path_to_chrome_driver_exe, debug=False)
        

    # this is the core of the Scraper, you must define it since by
    # convention is the callback called with the first url passed,
    # after, you can define other callbacks
    def parse(self):
        # the main pro of CHeSF is that you could use directly
        # javascript to parse the page
        script = """
	       let urls = [];
	       let anchors = document.querySelectorAll("a.property_title.prominent");
        
    	   for (let a of anchors)
                urls.push(a.href);

    	   return urls;
        """

        # the array returned from the javascript is automagically
        # transformed to a python list (this is selenium magic)
        links = self.call_js(script)

        for link in links:
            print(link)

        # you could use both xpath and css selectors (just change the
        # method you use)
        next_page = self.css('a.nav.next.taLnk.ui_button.primary', timeout=1)
        
        if len(next_page) > 0:
            # clicks are immediately executed
            self.enqueue_click(next_page[0], self.parse)
            
start_url = 'https://www.tripadvisor.com/Hotels-g187791-c2-Rome_Lazio-Hotels.html'
scraper = TripAdvisorScraper()

try:
    scraper.start(start_url)
except:
    scraper.quit()
    raise

Contacts

In case of questions and/or suggestions, write me a note using my GitHub contact email.

Mini FAQ

Q. Hey man, it absolutely doesn't work! What's wrong? A. Please, check that your ChromeDriver is suitable for the Chrome version you are using.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

nicodds / chesf

Programming Languages

Labels

Projects that are alternatives of or similar to chesf

Introduction

Basic usage

Contacts

Mini FAQ