All Projects β†’ ekalinin β†’ robots.js

ekalinin / robots.js

Licence: MIT License
Parser for robots.txt for node.js

Programming Languages

javascript
184084 projects - #8 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to robots.js

nuxt-humans-txt
πŸ§‘πŸ»πŸ‘©πŸ» "We are people, not machines" - An initiative to know the creators of a website. Contains the information about humans to the web building - A Nuxt Module to statically integrate and generate a humans.txt author file - Based on the HumansTxt Project.
Stars: ✭ 27 (-57.81%)
Mutual labels:  robots-txt, robots
aztarna
aztarna, a footprinting tool for robots.
Stars: ✭ 85 (+32.81%)
Mutual labels:  robots
jsitemapgenerator
Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming
Stars: ✭ 38 (-40.62%)
Mutual labels:  robots-txt
arguing-robots
πŸ€– Watch and hear macOS robots argue live in your terminal πŸ€–
Stars: ✭ 53 (-17.19%)
Mutual labels:  robots
TikTok
Download public videos on TikTok using Python with Selenium
Stars: ✭ 37 (-42.19%)
Mutual labels:  robots
openroberta-lab
The programming environment Β»Open Roberta LabΒ« by Fraunhofer IAIS enables children and adolescents to program robots. A variety of different programming blocks are provided to program motors and sensors of the robot. Open Roberta Lab uses an approach of graphical programming so that beginners can seamlessly start coding. As a cloud-based applica…
Stars: ✭ 98 (+53.13%)
Mutual labels:  robots
ultimate-sitemap-parser
Ultimate Website Sitemap Parser
Stars: ✭ 118 (+84.38%)
Mutual labels:  robots-txt
stoqs
Geospatial database visualization software for oceanographic measurement data
Stars: ✭ 31 (-51.56%)
Mutual labels:  robots
multi robot traj planner
An Efficient Multi-Robot Trajectory Planner for Ground Vehicles.
Stars: ✭ 104 (+62.5%)
Mutual labels:  robots
.NetCorePluginManager
.Net Core Plugin Manager, extend web applications using plugin technology enabling true SOLID and DRY principles when developing applications
Stars: ✭ 17 (-73.44%)
Mutual labels:  robots-txt
Nasa-And-Spacex-Cooperation
Theme Outer Space
Stars: ✭ 41 (-35.94%)
Mutual labels:  robots
RedBot
Design files and firmware files for the RedBot robotics board.
Stars: ✭ 22 (-65.62%)
Mutual labels:  robots
penny
3 servos, 10 dollars hexapod
Stars: ✭ 26 (-59.37%)
Mutual labels:  robots
robotstxt-webpack-plugin
A webpack plugin to generate a robots.txt file
Stars: ✭ 31 (-51.56%)
Mutual labels:  robots-txt
youtube-video-maker
πŸ“Ή A tool for automatic video creation and uploading on YouTube
Stars: ✭ 134 (+109.38%)
Mutual labels:  robots
community-projects
Webots projects (PROTO files, controllers, simulation worlds, etc.) contributed by the community.
Stars: ✭ 20 (-68.75%)
Mutual labels:  robots
robot hacking manual
Robot Hacking Manual (RHM). From robotics to cybersecurity. Papers, notes and writeups from a journey into robot cybersecurity.
Stars: ✭ 169 (+164.06%)
Mutual labels:  robots
summit xl sim
Packages for the simulation of the Summit XL, Summit XL HL and Summit-X (including X-WAM) robots
Stars: ✭ 32 (-50%)
Mutual labels:  robots
linorobot2
Autonomous mobile robots (2WD, 4WD, Mecanum Drive)
Stars: ✭ 97 (+51.56%)
Mutual labels:  robots
blender-robotics-utils
Set of utilities for exporting/controlling your robot in Blender
Stars: ✭ 26 (-59.37%)
Mutual labels:  robots

robots.js

robots.js β€” is parser for robots.txt files for node.js.

Installation

It's recommended to install via npm:

$ npm install -g robots

Usage

Here's an example of using robots.js:

var robots = require('robots')
  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  if(success) {
    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access) {
      if (access) {
        // parse url
      }
    });
  }
});

Default crawler user-agent is:

Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0

Here's an example of using another user-agent and more detailed callback:

var robots = require('robots')
  , parser = new robots.RobotsParser(
                'http://nodeguide.ru/robots.txt',
                'Mozilla/5.0 (compatible; RobotTxtBot/1.0)',
                after_parse
            );
            
function after_parse(parser, success) {
  if(success) {
    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access, url, reason) {
      if (access) {
        console.log(' url: '+url+', access: '+access);
        // parse url ...
      }
    });
  }
};

Here's an example of getting list of sitemaps:

var robots = require('robots')
  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  if(success) {
    parser.getSitemaps(function(sitemaps) {
      // sitemaps β€” array
    });
  }
});

Here's an example of getCrawlDelay usage:

    var robots = require('robots')
      , parser = new robots.RobotsParser();

    // for example:
    //
    // $ curl -s http://nodeguide.ru/robots.txt
    //
    // User-agent: Google-bot
    // Disallow: / 
    // Crawl-delay: 2
    //
    // User-agent: *
    // Disallow: /
    // Crawl-delay: 2

    parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
      if(success) {
        var GoogleBotDelay = parser.getCrawlDelay("Google-bot");
        // ...
      }
    });

An example of passing options to the HTTP request:

var options = {
  headers:{
    Authorization:"Basic " + new Buffer("username:password").toString("base64")}
}

var robots = require('robots')
  , parser = new robots.RobotsParser(null, options);

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  ...
});

API

RobotsParser β€” main class. This class provides a set of methods to read, parse and answer questions about a single robots.txt file.

  • setUrl(url, read) β€” sets the URL referring to a robots.txt file. by default, invokes read() method. If read is a function, it is called once the remote file is downloaded and parsed, and it takes in two arguments: the first is the parser itself, and the second is a boolean which is True if the the remote file was successfully parsed.
  • read(after_parse) β€” reads the robots.txt URL and feeds it to the parser
  • parse(lines) β€” parse the input lines from a robots.txt file
  • canFetch(userAgent, url, callback) β€” using the parsed robots.txt decide if userAgent can fetch url. Callback function: function callback(access, url, reason) { ... } where:
    • access β€” can this url be fetched. true/false.
    • url β€” target url
    • reason β€” reason for access. Object:
      • type β€” valid values: 'statusCode', 'entry', 'defaultEntry', 'noRule'
      • entry β€” an instance of lib/Entry.js:. Only for types: 'entry', 'defaultEntry'
      • statusCode β€” http response status code for url. Only for type 'statusCode'
  • canFetchSync(userAgent, url) β€” using the parsed robots.txt decide if userAgent can fetch url. Return true/false.
  • getCrawlDelay(userAgent) β€” returns Crawl-delay for the certain userAgent
  • getSitemaps(sitemaps) β€” gets Sitemaps from parsed robots.txt
  • getDisallowedPaths(userAgent) β€” gets paths explictly disallowed for the user agent specified AND *

License

See LICENSE file.

Resources

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].