All Projects → PuerkitoBio → Fetchbot

PuerkitoBio / Fetchbot

Licence: bsd-3-clause
A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Programming Languages

go
31211 projects - #10 most used programming language

Labels

Projects that are alternatives of or similar to Fetchbot

Fbcrawl
A Facebook crawler
Stars: ✭ 536 (-28.82%)
Mutual labels:  crawler
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (-22.58%)
Mutual labels:  crawler
Scrapyrt
HTTP API for Scrapy spiders
Stars: ✭ 637 (-15.41%)
Mutual labels:  crawler
Wechatsogou
基于搜狗微信搜索的微信公众号爬虫接口
Stars: ✭ 5,220 (+593.23%)
Mutual labels:  crawler
Netdiscovery
NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。
Stars: ✭ 573 (-23.9%)
Mutual labels:  crawler
Baiduimagespider
一个超级轻量的百度图片爬虫
Stars: ✭ 591 (-21.51%)
Mutual labels:  crawler
Xsrfprobe
The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.
Stars: ✭ 532 (-29.35%)
Mutual labels:  crawler
Xalpha
基金投资管理回测引擎
Stars: ✭ 683 (-9.3%)
Mutual labels:  crawler
Douyin
API of DouYin for Humans used to Crawl Popular Videos and Musics
Stars: ✭ 580 (-22.97%)
Mutual labels:  crawler
Price Monitor
京东商品价格监控:监控用户设定商品价格,降价邮件/微信提醒。技术:Python爬虫/IP代理池/JS接口爬取/Selenium页面爬取
Stars: ✭ 634 (-15.8%)
Mutual labels:  crawler
Xxl Crawler
A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
Stars: ✭ 561 (-25.5%)
Mutual labels:  crawler
Filemasta
A search application to explore, discover and share online files
Stars: ✭ 571 (-24.17%)
Mutual labels:  crawler
Course Crawler
🎓 中国大学MOOC、学堂在线、网易云课堂、好大学在线、爱课程 MOOC 课程下载。
Stars: ✭ 611 (-18.86%)
Mutual labels:  crawler
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+581.14%)
Mutual labels:  crawler
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (-12.88%)
Mutual labels:  crawler
Scrapy Redis
Redis-based components for Scrapy.
Stars: ✭ 4,998 (+563.75%)
Mutual labels:  crawler
Newcrawler
Free Web Scraping Tool with Java
Stars: ✭ 589 (-21.78%)
Mutual labels:  crawler
Magnet Dht
✌️ Python3 BitTorrent DHT crawler
Stars: ✭ 692 (-8.1%)
Mutual labels:  crawler
Grab Site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Stars: ✭ 680 (-9.69%)
Mutual labels:  crawler
Icrawler
A multi-thread crawler framework with many builtin image crawlers provided.
Stars: ✭ 629 (-16.47%)
Mutual labels:  crawler

fetchbot build status GoDoc

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

Installation

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt. It also integrates code from the iq package.

The API documentation is available on godoc.org.

Changes

  • 2019-09-11 (v1.2.0): update robotstxt dependency (import path/repo URL has changed, issue #31, thanks to @michael-stevens for raising the issue).
  • 2017-09-04 (v1.1.1): fix a goroutine leak when cancelling a Queue (issue #26, thanks to @ryu-koui for raising the issue).
  • 2017-07-06 (v1.1.0): add Queue.Done to get the done channel on the queue, allowing to wait in a select statement (thanks to @DennisDenuto).
  • 2015-07-25 (v1.0.0) : add Cancel method on the Queue, to close and drain without requesting any pending commands, unlike Close that waits for all pending commands to be processed (thanks to @buro9 for the feature request).
  • 2015-07-24 : add HandlerCmd and call the Command's Handler function if it implements the Handler interface, bypassing the Fetcher's handler. Support a Custom matcher on the Mux, using a predicate. (thanks to @mmcdole for the feature requests).
  • 2015-06-18 : add Scheme criteria on the muxer (thanks to @buro9).
  • 2015-06-10 : add DisablePoliteness field on the Fetcher to optionally bypass robots.txt checks (thanks to @oli-g).
  • 2014-07-04 : change the type of Fetcher.HttpClient from *http.Client to the Doer interface. Low chance of breaking existing code, but it's a possibility if someone used the fetcher's client to run other requests (e.g. f.HttpClient.Get(...)).

Usage

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Fetcher

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

Command-related Interfaces

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs.

  • BasicAuthProvider: Implement this interface to specify the basic authentication credentials to set on the request.

  • CookiesProvider: If the Command implements this interface, the provided Cookies will be set on the request.

  • HeaderProvider: Implement this interface to specify the headers to set on the request.

  • ReaderProvider: Implement this interface to set the body of the request, via an io.Reader.

  • ValuesProvider: Implement this interface to set the body of the request, as form-encoded values. If the Content-Type is not specifically set via a HeaderProvider, it is set to "application/x-www-form-urlencoded". ReaderProvider and ValuesProvider should be mutually exclusive as they both set the body of the request. If both are implemented, the ReaderProvider interface is used.

  • Handler: Implement this interface if the Command's response should be handled by a specific callback function. By default, the response is handled by the Fetcher's Handler, but if the Command implements this, this handler function takes precedence and the Fetcher's Handler is ignored.

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.

There is also a convenience HandlerCmd struct for the commands that should be handled by a specific callback function. It is a Command with a Handler interface implementation.

Fetcher Options

The Fetcher has a number of fields that provide further customization:

  • HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.

  • CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.

  • UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.

  • WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.

  • AutoClose : If true, closes the queue automatically once the number of active hosts reach 0.

  • DisablePoliteness : If true, ignores the robots.txt policies of the hosts.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License

The BSD 3-Clause license, the same as the Go language. The iq package source code is under the CDDL-1.0 license (details in the source file).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].