All Projects → povilasb → scrapy-html-storage

povilasb / scrapy-html-storage

Licence: MIT license
Scrapy downloader middleware that stores response HTMLs to disk.

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to scrapy-html-storage

estate-crawler
Scraping the real estate agencies for up-to-date house listings as soon as they arrive!
Stars: ✭ 20 (+17.65%)
Mutual labels:  scrapy
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (+252.94%)
Mutual labels:  scrapy
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+623.53%)
Mutual labels:  scrapy
pagser
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Stars: ✭ 82 (+382.35%)
Mutual labels:  scrapy
scrapy-rotated-proxy
A scrapy middleware to use rotated proxy ip list.
Stars: ✭ 22 (+29.41%)
Mutual labels:  scrapy
crawler
python爬虫项目集合
Stars: ✭ 29 (+70.59%)
Mutual labels:  scrapy
Spider job
招聘网数据爬虫
Stars: ✭ 234 (+1276.47%)
Mutual labels:  scrapy
itemadapter
Common interface for data container classes
Stars: ✭ 47 (+176.47%)
Mutual labels:  scrapy
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (+394.12%)
Mutual labels:  scrapy
Scrape-Finance-Data
My code for scraping financial data in Vietnam
Stars: ✭ 13 (-23.53%)
Mutual labels:  scrapy
lgcrawl
python+scrapy+splash 爬取拉勾全站职位信息
Stars: ✭ 22 (+29.41%)
Mutual labels:  scrapy
Scrapy-tripadvisor-reviews
Using scrapy to scrape tripadvisor in order to get users' reviews.
Stars: ✭ 24 (+41.18%)
Mutual labels:  scrapy
vietnam-ecommerce-crawler
Crawling the data from lazada, websosanh, compare.vn, cdiscount and cungmua with flexible configs
Stars: ✭ 28 (+64.71%)
Mutual labels:  scrapy
domains
World’s single largest Internet domains dataset
Stars: ✭ 461 (+2611.76%)
Mutual labels:  scrapy
ArticleSpider
Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).
Stars: ✭ 34 (+100%)
Mutual labels:  scrapy
Awesome crawl
腾讯新闻、知乎话题、微博粉丝,Tumblr爬虫、斗鱼弹幕、妹子图爬虫、分布式设计等
Stars: ✭ 246 (+1347.06%)
Mutual labels:  scrapy
asyncpy
使用asyncio和aiohttp开发的轻量级异步协程web爬虫框架
Stars: ✭ 86 (+405.88%)
Mutual labels:  scrapy
fernando-pessoa
Classificador de poemas do Fernando Pessoa de acordo com os seus heterônimos
Stars: ✭ 31 (+82.35%)
Mutual labels:  scrapy
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (+441.18%)
Mutual labels:  scrapy
scrapy-LBC
Araignée LeBonCoin avec Scrapy et ElasticSearch
Stars: ✭ 14 (-17.65%)
Mutual labels:  scrapy

About

https://travis-ci.org/povilasb/scrapy-html-storage.svg?branch=master

https://coveralls.io/repos/github/povilasb/scrapy-html-storage/badge.svg?branch=master:target:https://coveralls.io/github/povilasb/scrapy-html-storage?branch=master

This is Scrapy downloader middleware that stores response HTMLs to disk.

Usage

Turn downloader on, e.g. specifying it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_html_storage.HtmlStorageMiddleware': 10,
}

None of responses by default are saved to disk. You must select for which requests the response HTMLs will be saved:

def parse(self, response):
     """Processes start urls.

     Args:
         response (HtmlResponse): scrapy HTML response object.
     """
     yield scrapy.Request(
         'http://target.com',
         callback=self.parse_target,
         meta={
           'save_html': True,
         }
     )

The file path where HTML will be stored is resolved with spider method response_html_path. E.g.:

class TargetSpider(scrapy.Spider):
    def response_html_path(self, request):
        """
        Args:
            request (scrapy.http.request.Request): request that produced the
                response.
        """
        return 'html/last_response.html'

Configuration

HTML storage downloader middleware supports such options:

  • gzip_output (bool) - if True, HTML output will be stored in gzip format. Default is False.
  • save_html_on_status (list) - if not empty, sets list of response codes whitelisted for html saving. If list is empty or not provided, all response codes will be allowed for html saving.

Sample:

HTML_STORAGE = {
    'gzip_output': True,
    'save_html_on_status': [200, 202]
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].