All Projects → bopoda → robots-txt-parser

bopoda / robots-txt-parser

Licence: MIT license
PHP class for parse all directives from robots.txt files according to specifications

Programming Languages

PHP
23972 projects - #3 most used programming language

Projects that are alternatives of or similar to robots-txt-parser

Clickhouse Sqlalchemy
ClickHouse dialect for SQLAlchemy
Stars: ✭ 166 (+336.84%)
Mutual labels:  yandex
mapkit-android-demo
MapKit Android demo
Stars: ✭ 92 (+142.11%)
Mutual labels:  yandex
ekstertera
Linux GUI клиент для работы с Яндекс.Диск (Yandex.Disk) через REST API
Stars: ✭ 33 (-13.16%)
Mutual labels:  yandex
Cloudcross
CloudCross it's opensource crossplatform software for syncronization a local files and folders with many cloud providers. On this moment a Cloud Mail.Ru, Yandex.Disk, Google drive, OneDrive and Dropbox support is available
Stars: ✭ 185 (+386.84%)
Mutual labels:  yandex
django-yaturbo
Reusable Django app to enable Yandex Turbo Pages for your site
Stars: ✭ 13 (-65.79%)
Mutual labels:  yandex
aioch
aioch - is a library for accessing a ClickHouse database over native interface from the asyncio
Stars: ✭ 145 (+281.58%)
Mutual labels:  yandex
Dynamictranslator
Instant translation application for windows in .NET 🎪
Stars: ✭ 131 (+244.74%)
Mutual labels:  yandex
docker-machine-driver-yandex
Yandex.Cloud driver for Docker Machine
Stars: ✭ 21 (-44.74%)
Mutual labels:  yandex
yadisk
Download file from Yandex.Disk through share link
Stars: ✭ 33 (-13.16%)
Mutual labels:  yandex
yandex-direct-client
Lightweight and useful Yandex Direct API version 5 client
Stars: ✭ 16 (-57.89%)
Mutual labels:  yandex
Sitedorks
Search Google/Bing/Ecosia/DuckDuckGo/Yandex/Yahoo for a search term with a default set of websites, bug bounty programs or a custom collection.
Stars: ✭ 221 (+481.58%)
Mutual labels:  yandex
Deep Translator
A flexible free and unlimited python tool to translate between different languages in a simple way using multiple translators.
Stars: ✭ 233 (+513.16%)
Mutual labels:  yandex
pyaspeller
Python text speller
Stars: ✭ 26 (-31.58%)
Mutual labels:  yandex
Rust S3
Rust library for interfacing with AWS S3 and other API compatible services
Stars: ✭ 177 (+365.79%)
Mutual labels:  yandex
yametrikapy
Python library for Yandex Metrika API
Stars: ✭ 20 (-47.37%)
Mutual labels:  yandex
Clickhouse Net
Yandex ClickHouse fully managed .NET client
Stars: ✭ 142 (+273.68%)
Mutual labels:  yandex
butdr
Backup to Cloud( Google Drive, Dropbox ... ) use rclone
Stars: ✭ 49 (+28.95%)
Mutual labels:  yandex
drupal 8 unset html head link
🤖 Module for unset any wrong HTML links (like rel="delete-form", rel="edit-form", etc.) from head on Drupal 8.x websites. This is trust way to grow up position in SERP Google, Yandex, etc.
Stars: ✭ 19 (-50%)
Mutual labels:  yandex
yandex-disk-api
This library is built to use Yandex Disk API with PHP
Stars: ✭ 19 (-50%)
Mutual labels:  yandex
Awesome-meta-tags
📙 Awesome collection of meta tags
Stars: ✭ 18 (-52.63%)
Mutual labels:  yandex

robots-txt-parser

Build Status

RobotsTxtParser — PHP class for parsing all the directives of the robots.txt files

RobotsTxtValidator — PHP class for check is url allow or disallow according to robots.txt rules.

Try demo of RobotsTxtParser on-line on live domains.

Parsing is carried out according to the rules in accordance with Google & Yandex specifications:

Last improvements:

  1. Pars the Clean-param directive according to the clean-param syntax.
  2. Deleting comments (everything following the '#' character, up to the first line break, is disregarded)
  3. The improvement of the Parse of Host — the intersection directive, should refer to the user-agent '*'; If there are multiple hosts, the search engines take the value of the first.
  4. From the class, unused methods are removed, refactoring done, the scope of properties of the class is corrected.
  5. Added more test cases, as well as test cases added to the whole new functionality.
  6. RobotsTxtValidator class added to check if url is allowed to parsing.
  7. With version 2.0, the speed of RobotsTxtParser was significantly improved.

Supported Directives:

  • DIRECTIVE_ALLOW = 'allow';
  • DIRECTIVE_DISALLOW = 'disallow';
  • DIRECTIVE_HOST = 'host';
  • DIRECTIVE_SITEMAP = 'sitemap';
  • DIRECTIVE_USERAGENT = 'user-agent';
  • DIRECTIVE_CRAWL_DELAY = 'crawl-delay';
  • DIRECTIVE_CLEAN_PARAM = 'clean-param';
  • DIRECTIVE_NOINDEX = 'noindex';

Installation

Install the latest version with

$ composer require bopoda/robots-txt-parser

Run tests

Run phpunit tests using command

$ php vendor/bin/phpunit

Usage example

You can start the parser by getting the content of a robots.txt file from a website:

$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
var_dump($parser->getRules());

Or simply using the contents of the file as input (ie: when the content is already cached):

$parser = new RobotsTxtParser("
	User-Agent: *
	Disallow: /ajax
	Disallow: /search
	Clean-param: param1 /path/file.php

	User-agent: Yahoo
	Disallow: /

	Host: example.com
	Host: example2.com
");
var_dump($parser->getRules());

This will output:

array(2) {
  ["*"]=>
  array(3) {
    ["disallow"]=>
    array(2) {
      [0]=>
      string(5) "/ajax"
      [1]=>
      string(7) "/search"
    }
    ["clean-param"]=>
    array(1) {
      [0]=>
      string(21) "param1 /path/file.php"
    }
    ["host"]=>
    string(11) "example.com"
  }
  ["yahoo"]=>
  array(1) {
    ["disallow"]=>
    array(1) {
      [0]=>
      string(1) "/"
    }
  }
}

In order to validate URL, use the RobotsTxtValidator class:

$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
$validator = new RobotsTxtValidator($parser->getRules());

$url = '/';
$userAgent = 'MyAwesomeBot';

if ($validator->isUrlAllow($url, $userAgent)) {
    // Crawl the site URL and do nice stuff
}

Contribution

Feel free to create PR in this repository. Please, follow PSR style.

See the list of contributors which participated in this project.

Final Notes:

Please use v2.0+ version which works by same rules but is more highly performance.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].