All Projects → stummjr → scrapy-fieldstats

stummjr / scrapy-fieldstats

Licence: MIT license
A Scrapy extension to log items coverage when the spider shuts down

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to scrapy-fieldstats

ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (+300%)
Mutual labels:  scraping, crawling, scrapy
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (+488.24%)
Mutual labels:  scraping, crawling, scrapy
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+623.53%)
Mutual labels:  scraping, crawling, scrapy
scrapy-distributed
A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
Stars: ✭ 38 (+123.53%)
Mutual labels:  scraping, crawling, scrapy
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (+3329.41%)
Mutual labels:  scraping, crawling, scrapy
Grawler
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
Stars: ✭ 98 (+476.47%)
Mutual labels:  scraping, crawling
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (+441.18%)
Mutual labels:  scrapy, scrapy-extension
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+248976.47%)
Mutual labels:  scraping, crawling
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+905.88%)
Mutual labels:  scraping, crawling
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Stars: ✭ 789 (+4541.18%)
Mutual labels:  scraping, crawling
Seleniumcrawler
An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site
Stars: ✭ 117 (+588.24%)
Mutual labels:  scraping, scrapy
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (+211.76%)
Mutual labels:  scraping, crawling
Email Extractor
The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url
Stars: ✭ 81 (+376.47%)
Mutual labels:  scraping, scrapy
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+5923.53%)
Mutual labels:  scraping, scrapy
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (+5317.65%)
Mutual labels:  scraping, scrapy
Awesome Puppeteer
A curated list of awesome puppeteer resources.
Stars: ✭ 1,728 (+10064.71%)
Mutual labels:  scraping, crawling
Memorious
Distributed crawling framework for documents and structured data.
Stars: ✭ 248 (+1358.82%)
Mutual labels:  scraping, crawling
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+91282.35%)
Mutual labels:  scraping, crawling
RARBG-scraper
With Selenium headless browsing and CAPTCHA solving
Stars: ✭ 38 (+123.53%)
Mutual labels:  scraping, scrapy
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+30070.59%)
Mutual labels:  scraping, crawling

Scrapy FieldStats

Downloads

A Scrapy extension that generates a summary of fields coverage from your scraped data.

What?

Upon finishing a job, Scrapy prints some useful stats about that job, such as: number of requests, responses, scraped items, etc.

However, it's often useful to have an overview of the field coverage in such scraped items. Let's say you want to know the percentage of items missing the price field. That's when this extension comes into play!

Check out an example:

$ scrapy crawl example
2017-10-12 11:10:10 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: examplebot)
...
2017-10-12 11:10:20 [scrapy_fieldstats.fieldstats] INFO: Field stats:
{
    'author': {
        'name': '100.0%',
        'age':  '52.0%'
    },
    'image':  '97.0%',
    'title':  '100.0%',
    'price':  '92.0%',
    'stars':  '47.5%'
}
2017-10-12 11:10:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...

Installation

First, pip install this package:

$ pip install scrapy-fieldstats

Usage

Enable the extension in your project's settings.py file, by adding the following lines:

EXTENSIONS = {
    'scrapy_fieldstats.fieldstats.FieldStatsExtension': 10,
}
FIELDSTATS_ENABLED = True

That's all! Now run your job and have a look at the field stats.

Settings

The settings below can be defined as any other Scrapy settings, as described on Scrapy docs.

  • FIELDSTATS_ENABLED: to enable/disable the extension.
  • FIELDSTATS_COUNTS_ONLY: when True, the extension will output absolute counts, instead of percentages.
  • FIELDSTATS_SKIP_NONE: when True, None values won't be counted as existing values for fields.
  • FIELDSTATS_ADD_TO_STATS: when True, the extension will add the field coverage report to the job stats.

Contributing

If you spot a bug, or want to propose a new feature please create an issue in this project's issue tracker.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].