stummjr / scrapy-fieldstats

Licence: MIT license

A Scrapy extension to log items coverage when the spider shuts down

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to scrapy-fieldstats

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9

Stars: ✭ 68 (+300%)

Mutual labels: scraping, crawling, scrapy

Dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

Stars: ✭ 100 (+488.24%)

Mutual labels: scraping, crawling, scrapy

double-agent

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

Stars: ✭ 123 (+623.53%)

Mutual labels: scraping, crawling, scrapy

scrapy-distributed

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Stars: ✭ 38 (+123.53%)

Mutual labels: scraping, crawling, scrapy

Easy Scraping Tutorial

Simple but useful Python web scraping tutorial code.

Stars: ✭ 583 (+3329.41%)

Mutual labels: scraping, crawling, scrapy

Grawler

Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.

Stars: ✭ 98 (+476.47%)

Mutual labels: scraping, crawling

scrapy-wayback-machine

A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Stars: ✭ 92 (+441.18%)

Mutual labels: scrapy, scrapy-extension

Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Stars: ✭ 42,343 (+248976.47%)

Mutual labels: scraping, crawling

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (+905.88%)

Mutual labels: scraping, crawling

Lulu

[Unmaintained] A simple and clean video/music/image downloader 👾

Stars: ✭ 789 (+4541.18%)

Mutual labels: scraping, crawling

Seleniumcrawler

An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site

Stars: ✭ 117 (+588.24%)

Mutual labels: scraping, scrapy

diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

Stars: ✭ 53 (+211.76%)

Mutual labels: scraping, crawling

Email Extractor

The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url

Stars: ✭ 81 (+376.47%)

Mutual labels: scraping, scrapy

Django Dynamic Scraper

Creating Scrapy scrapers via the Django admin interface

Stars: ✭ 1,024 (+5923.53%)

Mutual labels: scraping, scrapy

Scrapy Cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

Stars: ✭ 921 (+5317.65%)

Mutual labels: scraping, scrapy

Awesome Puppeteer

A curated list of awesome puppeteer resources.

Stars: ✭ 1,728 (+10064.71%)

Mutual labels: scraping, crawling

Memorious

Distributed crawling framework for documents and structured data.

Stars: ✭ 248 (+1358.82%)

Mutual labels: scraping, crawling

Colly

Elegant Scraper and Crawler Framework for Golang

Stars: ✭ 15,535 (+91282.35%)

Mutual labels: scraping, crawling

RARBG-scraper

With Selenium headless browsing and CAPTCHA solving

Stars: ✭ 38 (+123.53%)

Mutual labels: scraping, scrapy

Headless Chrome Crawler

Distributed crawler powered by Headless Chrome

Stars: ✭ 5,129 (+30070.59%)

Mutual labels: scraping, crawling

View All Similar Projects ➔

Scrapy FieldStats

A Scrapy extension that generates a summary of fields coverage from your scraped data.

What?

Upon finishing a job, Scrapy prints some useful stats about that job, such as: number of requests, responses, scraped items, etc.

However, it's often useful to have an overview of the field coverage in such scraped items. Let's say you want to know the percentage of items missing the price field. That's when this extension comes into play!

Check out an example:

$ scrapy crawl example
2017-10-12 11:10:10 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: examplebot)
...
2017-10-12 11:10:20 [scrapy_fieldstats.fieldstats] INFO: Field stats:
{
    'author': {
        'name': '100.0%',
        'age':  '52.0%'
    },
    'image':  '97.0%',
    'title':  '100.0%',
    'price':  '92.0%',
    'stars':  '47.5%'
}
2017-10-12 11:10:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...

Installation

First, pip install this package:

$ pip install scrapy-fieldstats

Usage

Enable the extension in your project's settings.py file, by adding the following lines:

EXTENSIONS = {
    'scrapy_fieldstats.fieldstats.FieldStatsExtension': 10,
}
FIELDSTATS_ENABLED = True

That's all! Now run your job and have a look at the field stats.

Settings

The settings below can be defined as any other Scrapy settings, as described on Scrapy docs.

FIELDSTATS_ENABLED: to enable/disable the extension.
FIELDSTATS_COUNTS_ONLY: when True, the extension will output absolute counts, instead of percentages.
FIELDSTATS_SKIP_NONE: when True, None values won't be counted as existing values for fields.
FIELDSTATS_ADD_TO_STATS: when True, the extension will add the field coverage report to the job stats.

Contributing

If you spot a bug, or want to propose a new feature please create an issue in this project's issue tracker.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

stummjr / scrapy-fieldstats

Programming Languages

Labels

Projects that are alternatives of or similar to scrapy-fieldstats

Scrapy FieldStats

What?

Installation

Usage

Settings

Contributing