All Projects → maxhumber → Gazpacho

maxhumber / Gazpacho

Licence: mit
🥫 The simple, fast, and modern web scraping library

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Gazpacho

Nytimes App
🗽 A Simple Demonstration of the New York Times App 📱 using Jsoup web crawler with MVVM Architecture 🔥
Stars: ✭ 246 (-53.14%)
Mutual labels:  hacktoberfest, webscraping
chesf
CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages
Stars: ✭ 18 (-96.57%)
Mutual labels:  scraping, webscraping
ioweb
Web Scraping Framework
Stars: ✭ 31 (-94.1%)
Mutual labels:  scraping, webscraping
Torchbear
🔥🐻 The Speakeasy Scripting Engine Which Combines Speed, Safety, and Simplicity
Stars: ✭ 128 (-75.62%)
Mutual labels:  hacktoberfest, scraping
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+676.57%)
Mutual labels:  scraping, webscraping
Educative.io Downloader
📖 This tool is to download course from educative.io for offline usage. It uses your login credentials and download the course.
Stars: ✭ 139 (-73.52%)
Mutual labels:  hacktoberfest, scraping
anime-scraper
[partially working] Scrape and add anime episode stream URLs to uGet (Linux) or IDM (Windows) ~ Python3
Stars: ✭ 21 (-96%)
Mutual labels:  scraping, webscraping
Dataengineeringproject
Example end to end data engineering project.
Stars: ✭ 82 (-84.38%)
Mutual labels:  hacktoberfest, scraping
Spidermon
Scrapy Extension for monitoring spiders execution.
Stars: ✭ 309 (-41.14%)
Mutual labels:  hacktoberfest, scraping
schedule-tweet
Schedules tweets using TweetDeck
Stars: ✭ 14 (-97.33%)
Mutual labels:  scraping, webscraping
Geeksforgeeksscrapper
Scrapes g4g and creates PDF
Stars: ✭ 124 (-76.38%)
Mutual labels:  hacktoberfest, webscraping
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+821.33%)
Mutual labels:  hacktoberfest, scraping
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+7965.33%)
Mutual labels:  hacktoberfest, scraping
Panther
A browser testing and web crawling library for PHP and Symfony
Stars: ✭ 2,480 (+372.38%)
Mutual labels:  hacktoberfest, scraping
Pastepwn
Python framework to scrape Pastebin pastes and analyze them
Stars: ✭ 87 (-83.43%)
Mutual labels:  hacktoberfest, scraping
browser-automation-api
Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.
Stars: ✭ 24 (-95.43%)
Mutual labels:  scraping, webscraping
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (-80.95%)
Mutual labels:  scraping, webscraping
Parsel
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
Stars: ✭ 628 (+19.62%)
Mutual labels:  hacktoberfest, scraping
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (-87.05%)
Mutual labels:  scraping, webscraping
Suckit
Suck the InTernet
Stars: ✭ 429 (-18.29%)
Mutual labels:  hacktoberfest, webscraping

gazpacho

Travis PyPI PyPI - Python Version Downloads

About

gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.

Install

Install with pip at the command line:

pip install -U gazpacho

Quickstart

Give this a try:

from gazpacho import get, Soup

url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)

def parse(book):
    name = book.find('h4').text
    price = float(book.find('p').text[1:].split(' ')[0])
    return name, price

[parse(book) for book in books]

Tutorial

Import

Import gazpacho following the convention:

from gazpacho import get, Soup

get

Use the get function to download raw HTML:

url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <met'

Adjust get requests with optional params and headers:

get(
    url='https://httpbin.org/anything',
    params={'foo': 'bar', 'bar': 'baz'},
    headers={'User-Agent': 'gazpacho'}
)

Soup

Use the Soup wrapper on raw html to enable parsing:

soup = Soup(html)

Soup objects can alternatively be initialized with the .get classmethod:

soup = Soup.get(url)

.find

Use the .find method to target and extract HTML tags:

h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>

attrs=

Use the attrs argument to isolate tags that contain specific HTML element attributes:

soup.find('div', attrs={'class': 'section-'})

partial=

Element attributes are partially matched by default. Turn this off by setting partial to False:

soup.find('div', {'class': 'soup'}, partial=False)

mode=

Override the mode argument {'auto', 'first', 'all'} to guarantee return behaviour:

print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8

dir()

Soup objects have html, tag, attrs, and text attributes:

dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Use them accordingly:

print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# Soup

Support

If you use gazpacho, consider adding the scraper: gazpacho badge to your project README.md:

[![scraper: gazpacho](https://img.shields.io/badge/scraper-gazpacho-C6422C)](https://github.com/maxhumber/gazpacho)

Contribute

For feature requests or bug reports, please use Github Issues

For PRs, please read the CONTRIBUTING.md document

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].