All Projects → fooock → robots.txt

fooock / robots.txt

Licence: GPL-3.0 license
🤖 robots.txt as a service. Crawls robots.txt files, downloads and parses them to check rules through an API

Programming Languages

java
68154 projects - #9 most used programming language
kotlin
9241 projects
shell
77523 projects
Makefile
30231 projects
ANTLR
299 projects
Dockerfile
14818 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to robots.txt

orkid-node
Reliable and modern Redis Streams based task queue for Node.js 🤖
Stars: ✭ 61 (+369.23%)
Mutual labels:  redis-stream, redis-streams
robots-parser
NodeJS robots.txt parser with support for wildcard (*) matching.
Stars: ✭ 117 (+800%)
Mutual labels:  robots-txt, robots-parser
protego
A pure-Python robots.txt parser with support for modern conventions.
Stars: ✭ 36 (+176.92%)
Mutual labels:  robots-txt, robots-parser
jsitemapgenerator
Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming
Stars: ✭ 38 (+192.31%)
Mutual labels:  robots-txt
nuxt-humans-txt
🧑🏻👩🏻 "We are people, not machines" - An initiative to know the creators of a website. Contains the information about humans to the web building - A Nuxt Module to statically integrate and generate a humans.txt author file - Based on the HumansTxt Project.
Stars: ✭ 27 (+107.69%)
Mutual labels:  robots-txt
Java-AgentSpeak
LightJason - AgentSpeak(L++) for Java
Stars: ✭ 21 (+61.54%)
Mutual labels:  antlr4
libra
Java Predicate, supports SQL-like syntax
Stars: ✭ 30 (+130.77%)
Mutual labels:  antlr4
ultimate-sitemap-parser
Ultimate Website Sitemap Parser
Stars: ✭ 118 (+807.69%)
Mutual labels:  robots-txt
Free proxy pool
对免费代理IP网站进行爬取,收集汇总为自己的代理池。关键是验证代理的有效性、匿名性、去重复
Stars: ✭ 66 (+407.69%)
Mutual labels:  spiders
AnimalRecognitionDemo
An example of using Redis Streams, RedisGears and RedisAI for Realtime Video Analytics (i.e. filtering cats)
Stars: ✭ 35 (+169.23%)
Mutual labels:  redis-streams
Gocrawl
Polite, slim and concurrent web crawler.
Stars: ✭ 1,962 (+14992.31%)
Mutual labels:  robots-txt
robotstxt-webpack-plugin
A webpack plugin to generate a robots.txt file
Stars: ✭ 31 (+138.46%)
Mutual labels:  robots-txt
java-ast
Java Parser for JavaScript/TypeScript (based on antlr4ts)
Stars: ✭ 58 (+346.15%)
Mutual labels:  antlr4
.NetCorePluginManager
.Net Core Plugin Manager, extend web applications using plugin technology enabling true SOLID and DRY principles when developing applications
Stars: ✭ 17 (+30.77%)
Mutual labels:  robots-txt
antlr4-tool
A useful Antlr4 tool with full TypeScript support
Stars: ✭ 34 (+161.54%)
Mutual labels:  antlr4
BaiduSpider
项目已经移动至:https://github.com/BaiduSpider/BaiduSpider !! 一个爬取百度搜索结果的爬虫,目前支持百度网页搜索,百度图片搜索,百度知道搜索,百度视频搜索,百度资讯搜索,百度文库搜索,百度经验搜索和百度百科搜索。
Stars: ✭ 29 (+123.08%)
Mutual labels:  spiders
yahdl
A programming language for FPGAs.
Stars: ✭ 20 (+53.85%)
Mutual labels:  antlr4
gatsby-plugin-robots-txt
Gatsby plugin that automatically creates robots.txt for your site
Stars: ✭ 105 (+707.69%)
Mutual labels:  robots-txt
parcera
Grammar-based Clojure(script) parser
Stars: ✭ 100 (+669.23%)
Mutual labels:  antlr4
grobotstxt
grobotstxt is a native Go port of Google's robots.txt parser and matcher library.
Stars: ✭ 83 (+538.46%)
Mutual labels:  robots-txt

🤖 robots.txt as a service 🤖

License Twitter Follow

🚧 Project in development

Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler, and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.

For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it here. Expect support from other robot specifications soon!

Why this project?

If you are building a distributed web crawler, you know that manage robots.txt rules from websites is a hard task, and can be complicated to maintain in a scalable way. You need to focus on your business requirements. robots.txt can help by acting as a service to check if a given url resource can be crawled using a specified user agent (or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second!

Requirements

In order to build this project in your machine you will need to have installed in your system:

Getting started

If you want to test this project locally, then you will need to be installed in your system Docker, docker-compose and Make. When done, then execute the following command to compile all projects, build docker images and run it:

👉 Be patient!

$ make start-all

You can execute make logs to see how things have gone

Now you can send some URL's to the crawler system to download the rules found in the robots.txt file and persist it in the database. For example, you can invoke the crawl API using this command:

$ curl -X POST http://localhost:9081/v1/send \
       -d 'url=https://news.ycombinator.com/newcomments' \
       -H 'Content-Type: application/x-www-form-urlencoded'

Also, there is another method in the API to make a crawl request but using a GET method. If you want to check all methods this application expose, import this Postman collection.

This command will send the URL to the streaming service, and when received, the robots.txt file will be downloaded, parsed and saved into the database.

The next step is to check if you can access any resource of a known host using a user-agent directive. For this purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the newest resource from hacker news. You will execute:

$ curl -X POST http://localhost:9080/v1/allowed \
       -d '{"url": "https://news.ycombinator.com/newest","agent": "AwesomeBot"}' \
       -H 'Content-Type: application/json'

The response will be:

{
  "url":"https://news.ycombinator.com/newest",
  "agent":"AwesomeBot",
  "allowed":true
}

This is like saying: Hey!, you can crawl content from https://news.ycombinator.com/newest

When you finish your test, execute the next command to stop and remove all docker containers:

$ make stop-all

🔥 Happy Hacking! 🔥

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].