All Projects → elacuesta → Scrapy Pyppeteer

elacuesta / Scrapy Pyppeteer

Licence: bsd-3-clause
Pyppeteer integration for Scrapy

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Labels

Projects that are alternatives of or similar to Scrapy Pyppeteer

House Renting
Possibly the best practice of Scrapy 🕷 and renting a house 🏡
Stars: ✭ 741 (+1443.75%)
Mutual labels:  scrapy
Voyages Sncf Api
A scrapy spider that scraps times and prices from Voyages Sncf. It uses scrapyrt to provide an API interface.
Stars: ✭ 7 (-85.42%)
Mutual labels:  scrapy
Articlespider
慕课网python分布式爬虫源码-长期更新维护
Stars: ✭ 40 (-16.67%)
Mutual labels:  scrapy
Py3 scripts
Life is short, *****.
Stars: ✭ 5 (-89.58%)
Mutual labels:  scrapy
Mailinglistscraper
A python web scraper for public email lists.
Stars: ✭ 19 (-60.42%)
Mutual labels:  scrapy
Jspider
JSpider会每周更新至少一个网站的JS解密方式,欢迎 Star,交流微信:13298307816
Stars: ✭ 914 (+1804.17%)
Mutual labels:  scrapy
Tweetscraper
TweetScraper is a simple crawler/spider for Twitter Search without using API
Stars: ✭ 694 (+1345.83%)
Mutual labels:  scrapy
Pixiv Crawler
Scrapy框架下的pixiv多功能爬虫
Stars: ✭ 46 (-4.17%)
Mutual labels:  scrapy
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (+1818.75%)
Mutual labels:  scrapy
App comments spider
爬取百度贴吧、TapTap、appstore、微博官方博主上的游戏评论(基于redis_scrapy),过滤器采用了bloomfilter。
Stars: ✭ 38 (-20.83%)
Mutual labels:  scrapy
Seeker
Seeker - another job board aggregator.
Stars: ✭ 16 (-66.67%)
Mutual labels:  scrapy
Pdf downloader
A Scrapy Spider for downloading PDF files from a webpage.
Stars: ✭ 18 (-62.5%)
Mutual labels:  scrapy
Place2live
Analysis of the characteristics of different countries
Stars: ✭ 30 (-37.5%)
Mutual labels:  scrapy
Funpyspidersearchengine
Word2vec 千人千面 个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索
Stars: ✭ 782 (+1529.17%)
Mutual labels:  scrapy
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+17383.33%)
Mutual labels:  scrapy
Jd spider
两只蠢萌京东的分布式爬虫.
Stars: ✭ 738 (+1437.5%)
Mutual labels:  scrapy
Scrapy Azuresearch Crawler Samples
Scrapy as a Web Crawler for Azure Search Samples
Stars: ✭ 20 (-58.33%)
Mutual labels:  scrapy
Wescraper
依赖Scrapy和搜狗搜索微信公众号文章
Stars: ✭ 46 (-4.17%)
Mutual labels:  scrapy
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+2033.33%)
Mutual labels:  scrapy
Scrapymon
Simple Web UI for Scrapy spider management via Scrapyd
Stars: ✭ 35 (-27.08%)
Mutual labels:  scrapy

Unmaintained

If you need browser integration for Scrapy, please consider using scrapy-playwright


Pyppeteer integration for Scrapy

version pyversions actions codecov

This project provides a Scrapy Download Handler which performs requests using Pyppeteer. It can be used to handle pages that require JavaScript. This package does not interfere with regular Scrapy workflows such as request scheduling or item processing.

Motivation

After the release of version 2.0, which includes partial coroutine syntax support and experimental asyncio support, Scrapy allows to integrate asyncio-based projects such as Pyppeteer.

Requirements

  • Python 3.6+
  • Scrapy 2.0+
  • Pyppeteer 0.0.23+

Installation

$ pip install scrapy-pyppeteer

Configuration

Replace the default http and https Download Handlers through DOWNLOAD_HANDLERS:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
    "https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
}

Note that the ScrapyPyppeteerDownloadHandler class inherits from the default http/https handler, and it will only use Pyppeteer for requests that are explicitly marked (see the "Basic usage" section for details).

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

scrapy-pyppeteer accepts the following settings:

  • PYPPETEER_LAUNCH_OPTIONS (type dict, default {})

    A dictionary with options to be passed when launching the Browser. See the docs for pyppeteer.launcher.launch

  • PYPPETEER_NAVIGATION_TIMEOUT (type Optional[int], default None)

    Default timeout (in milliseconds) to be used when requesting pages by Pyppeteer. If None or unset, the default value will be used (30000 ms at the time of writing this). See the docs for pyppeteer.page.Page.setDefaultNavigationTimeout

  • PYPPETEER_PAGE_COROUTINE_TIMEOUT (type Optional[Union[int, float]], default None)

    Default timeout (in milliseconds) to be passed when using page coroutines, such as waitForSelector or waitForXPath. If None or unset, the default value will be used (30000 ms at the time of writing this).

Basic usage

Set the pyppeteer Request.meta key to download a request using Pyppeteer:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"pyppeteer": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"pyppeteer": True},
        )

    def parse(self, response):
        # 'response' contains the page as seen by the browser
        yield {"url": response.url}

Page coroutines

A sorted iterable (list, tuple or dict, for instance) could be passed in the pyppeteer_page_coroutines Request.meta key to request coroutines to be awaited on the Page before returning the final Response to the callback.

This is useful when you need to perform certain actions on a page, like scrolling down or clicking links, and you want everything to count as a single Scrapy Response, containing the final result.

Supported actions

  • scrapy_pyppeteer.page.PageCoroutine(method: str, *args, **kwargs):

    Represents a coroutine to be awaited on a pyppeteer.page.Page object, such as "click", "screenshot", "evaluate", etc. method should be the name of the coroutine, *args and **kwargs are passed to the function call.

    The coroutine result will be stored in the PageCoroutine.result attribute

    For instance,

    PageCoroutine("screenshot", options={"path": "quotes.png", "fullPage": True})
    

    produces the same effect as:

    # 'page' is a pyppeteer.page.Page object
    await page.screenshot(options={"path": "quotes.png", "fullPage": True})
    
  • scrapy_pyppeteer.page.NavigationPageCoroutine(method: str, *args, **kwargs):

    Subclass of PageCoroutine. It waits for a navigation event: use this when you know a coroutine will trigger a navigation event, for instance when clicking on a link. This forces a Page.waitForNavigation() call wrapped in asyncio.gather, as recommended in the Pyppeteer docs.

    For instance,

    NavigationPageCoroutine("click", selector="a")
    

    produces the same effect as:

    # 'page' is a pyppeteer.page.Page object
    await asyncio.gather(
        page.waitForNavigation(),
        page.click(selector="a"),
    )
    

Receiving the Page object in the callback

Specifying pyppeteer.page.Page as the type for a callback argument will result in the corresponding Page object being injected in the callback. In order to able to await coroutines on the provided Page object, the callback needs to be defined as a coroutine function (async def).

import scrapy
import pyppeteer

class AwesomeSpiderWithPage(scrapy.Spider):
    name = "page"

    def start_requests(self):
        yield scrapy.Request("https://example.org", meta={"pyppeteer": True})

    async def parse(self, response, page: pyppeteer.page.Page):
        title = await page.title()  # "Example Domain"
        yield {"title": title}
        await page.close()

Notes:

  • In order to avoid memory issues, it is recommended to manually close the page by awaiting the Page.close coroutine.
  • Any network operations resulting from awaiting a coroutine on a Page object (goto, goBack, etc) will be executed directly by Pyppeteer, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).

Examples

Click on a link, save the resulting page as PDF

import scrapy
from scrapy_pyppeteer.page import PageCoroutine, NavigationPageCoroutine

class ClickAndSavePdfSpider(scrapy.Spider):
    name = "pdf"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                pyppeteer=True,
                pyppeteer_page_coroutines={
                    "click": NavigationPageCoroutine("click", selector="a"),
                    "pdf": PageCoroutine("pdf", options={"path": "/tmp/file.pdf"}),
                },
            ),
        )

    def parse(self, response):
        pdf_bytes = response.meta["pyppeteer_page_coroutines"]["pdf"].result
        with open("iana.pdf", "wb") as fp:
            fp.write(pdf_bytes)
        yield {"url": response.url}  # response.url is "https://www.iana.org/domains/reserved"

Scroll down on an infinite scroll page, take a screenshot of the full page

import scrapy
import pyppeteer
from scrapy_pyppeteer.page import PageCoroutine

class ScrollSpider(scrapy.Spider):
    name = "scroll"

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            meta=dict(
                pyppeteer=True,
                pyppeteer_page_coroutines=[
                    PageCoroutine("waitForSelector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("waitForSelector", "div.quote:nth-child(11)"),  # 10 per page
                    PageCoroutine("screenshot", options={"path": "quotes.png", "fullPage": True}),
                ],
            ),
        )

    def parse(self, response):
        return {"quote_count": len(response.css("div.quote"))}

Acknowledgements

This project was inspired by:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].