All Projects β†’ scrapy β†’ protego

scrapy / protego

Licence: BSD-3-Clause license
A pure-Python robots.txt parser with support for modern conventions.

Programming Languages

python
139335 projects - #7 most used programming language
ruby
36898 projects - #4 most used programming language
Berry
2 projects
mupad
8 projects
perl
6916 projects

Projects that are alternatives of or similar to protego

robots-parser
NodeJS robots.txt parser with support for wildcard (*) matching.
Stars: ✭ 117 (+225%)
Mutual labels:  robots-txt, robots-parser
robots.txt
πŸ€– robots.txt as a service. Crawls robots.txt files, downloads and parses them to check rules through an API
Stars: ✭ 13 (-63.89%)
Mutual labels:  robots-txt, robots-parser
Gocrawl
Polite, slim and concurrent web crawler.
Stars: ✭ 1,962 (+5350%)
Mutual labels:  robots-txt
gatsby-plugin-robots-txt
Gatsby plugin that automatically creates robots.txt for your site
Stars: ✭ 105 (+191.67%)
Mutual labels:  robots-txt
grobotstxt
grobotstxt is a native Go port of Google's robots.txt parser and matcher library.
Stars: ✭ 83 (+130.56%)
Mutual labels:  robots-txt
robots.js
Parser for robots.txt for node.js
Stars: ✭ 64 (+77.78%)
Mutual labels:  robots-txt
robotstxt-webpack-plugin
A webpack plugin to generate a robots.txt file
Stars: ✭ 31 (-13.89%)
Mutual labels:  robots-txt
nuxt-humans-txt
πŸ§‘πŸ»πŸ‘©πŸ» "We are people, not machines" - An initiative to know the creators of a website. Contains the information about humans to the web building - A Nuxt Module to statically integrate and generate a humans.txt author file - Based on the HumansTxt Project.
Stars: ✭ 27 (-25%)
Mutual labels:  robots-txt
.NetCorePluginManager
.Net Core Plugin Manager, extend web applications using plugin technology enabling true SOLID and DRY principles when developing applications
Stars: ✭ 17 (-52.78%)
Mutual labels:  robots-txt
jsitemapgenerator
Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming
Stars: ✭ 38 (+5.56%)
Mutual labels:  robots-txt
ultimate-sitemap-parser
Ultimate Website Sitemap Parser
Stars: ✭ 118 (+227.78%)
Mutual labels:  robots-txt
robotify-netcore
Provides robots.txt middleware for .NET core
Stars: ✭ 15 (-58.33%)
Mutual labels:  robots-txt
robotparser-rs
robots.txt parser for Rust.
Stars: ✭ 16 (-55.56%)
Mutual labels:  robots-parser

Protego

Supported Python Versions CI

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

  Protego RobotFileParser Reppy Robotexclusionrulesparser
Implementation language Python Python C++ Python
Reference specification Google Martijn Koster’s 1996 draft
Wildcard support βœ“   βœ“ βœ“
Length-based precedence βœ“   βœ“  
Performance   +40% +1300% -25%

API Reference

Class protego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified in robots.txt.
  • preferred_host {string} Preferred host specified in robots.txt.

Methods

  • parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.
  • can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.
  • crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
  • request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].