Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming

Stars: ✭ 38 (+192.31%)

Mutual labels: robots-txt

nuxt-humans-txt

🧑🏻👩🏻 "We are people, not machines" - An initiative to know the creators of a website. Contains the information about humans to the web building - A Nuxt Module to statically integrate and generate a humans.txt author file - Based on the HumansTxt Project.

Stars: ✭ 27 (+107.69%)

Mutual labels: robots-txt

Java-AgentSpeak

LightJason - AgentSpeak(L++) for Java

Stars: ✭ 21 (+61.54%)

Mutual labels: antlr4

libra

Java Predicate, supports SQL-like syntax

Stars: ✭ 30 (+130.77%)

Mutual labels: antlr4

ultimate-sitemap-parser

Ultimate Website Sitemap Parser

Stars: ✭ 118 (+807.69%)

Mutual labels: robots-txt

Free proxy pool

对免费代理IP网站进行爬取，收集汇总为自己的代理池。关键是验证代理的有效性、匿名性、去重复

Stars: ✭ 66 (+407.69%)

Mutual labels: spiders

AnimalRecognitionDemo

An example of using Redis Streams, RedisGears and RedisAI for Realtime Video Analytics (i.e. filtering cats)

Stars: ✭ 35 (+169.23%)

Mutual labels: redis-streams

Gocrawl

Polite, slim and concurrent web crawler.

Stars: ✭ 1,962 (+14992.31%)

Mutual labels: robots-txt

robotstxt-webpack-plugin

A webpack plugin to generate a robots.txt file

Stars: ✭ 31 (+138.46%)

Mutual labels: robots-txt

java-ast

Java Parser for JavaScript/TypeScript (based on antlr4ts)

Stars: ✭ 58 (+346.15%)

Mutual labels: antlr4

.NetCorePluginManager

.Net Core Plugin Manager, extend web applications using plugin technology enabling true SOLID and DRY principles when developing applications

Stars: ✭ 17 (+30.77%)

Mutual labels: robots-txt

antlr4-tool

A useful Antlr4 tool with full TypeScript support

Stars: ✭ 34 (+161.54%)

Mutual labels: antlr4

BaiduSpider

项目已经移动至：https://github.com/BaiduSpider/BaiduSpider ！！一个爬取百度搜索结果的爬虫，目前支持百度网页搜索，百度图片搜索，百度知道搜索，百度视频搜索，百度资讯搜索，百度文库搜索，百度经验搜索和百度百科搜索。

Stars: ✭ 29 (+123.08%)

Mutual labels: spiders

yahdl

A programming language for FPGAs.

Stars: ✭ 20 (+53.85%)

Mutual labels: antlr4

gatsby-plugin-robots-txt

Gatsby plugin that automatically creates robots.txt for your site

Stars: ✭ 105 (+707.69%)

Mutual labels: robots-txt

parcera

Grammar-based Clojure(script) parser

Stars: ✭ 100 (+669.23%)

Mutual labels: antlr4

grobotstxt

grobotstxt is a native Go port of Google's robots.txt parser and matcher library.

Stars: ✭ 83 (+538.46%)

Mutual labels: robots-txt

View All Similar Projects ➔

🤖 `robots.txt` as a service 🤖

🚧 Project in development

Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler, and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.

For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it here. Expect support from other robot specifications soon!

Why this project?

If you are building a distributed web crawler, you know that manage robots.txt rules from websites is a hard task, and can be complicated to maintain in a scalable way. You need to focus on your business requirements. robots.txt can help by acting as a service to check if a given url resource can be crawled using a specified user agent (or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second!

Requirements

In order to build this project in your machine you will need to have installed in your system:

Java 11 and Kotlin
Docker
docker-compose
make

Getting started

If you want to test this project locally, then you will need to be installed in your system Docker, docker-compose and Make. When done, then execute the following command to compile all projects, build docker images and run it:

👉 Be patient!

$ make start-all

You can execute make logs to see how things have gone

Now you can send some URL's to the crawler system to download the rules found in the robots.txt file and persist it in the database. For example, you can invoke the crawl API using this command:

$ curl -X POST http://localhost:9081/v1/send \
       -d 'url=https://news.ycombinator.com/newcomments' \
       -H 'Content-Type: application/x-www-form-urlencoded'

Also, there is another method in the API to make a crawl request but using a GET method. If you want to check all methods this application expose, import this Postman collection.

This command will send the URL to the streaming service, and when received, the robots.txt file will be downloaded, parsed and saved into the database.

The next step is to check if you can access any resource of a known host using a user-agent directive. For this purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the newest resource from hacker news. You will execute:

$ curl -X POST http://localhost:9080/v1/allowed \
       -d '{"url": "https://news.ycombinator.com/newest","agent": "AwesomeBot"}' \
       -H 'Content-Type: application/json'

The response will be:

{
  "url":"https://news.ycombinator.com/newest",
  "agent":"AwesomeBot",
  "allowed":true
}

This is like saying: Hey!, you can crawl content from https://news.ycombinator.com/newest

When you finish your test, execute the next command to stop and remove all docker containers:

$ make stop-all

🔥 Happy Hacking! 🔥

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

fooock / robots.txt

Programming Languages

Labels

Projects that are alternatives of or similar to robots.txt

🤖 `robots.txt` as a service 🤖

Why this project?

Requirements

Getting started

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

fooock / robots.txt

Programming Languages

Labels

Projects that are alternatives of or similar to robots.txt

🤖 robots.txt as a service 🤖

Why this project?

Requirements

Getting started

🤖 `robots.txt` as a service 🤖