All Projects → fredwu → Crawler

fredwu / Crawler

A high performance web crawler in Elixir.

Programming Languages

elixir
2628 projects

Projects that are alternatives of or similar to Crawler

Fbcrawl
A Facebook crawler
Stars: ✭ 536 (-31.37%)
Mutual labels:  crawler, spider, scraper
Querylist
🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
Stars: ✭ 2,392 (+206.27%)
Mutual labels:  crawler, spider, scraper
Not Your Average Web Crawler
A web crawler (for bug hunting) that gathers more than you can imagine.
Stars: ✭ 107 (-86.3%)
Mutual labels:  crawler, spider, scraper
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+513.7%)
Mutual labels:  crawler, spider, scraper
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (-55.44%)
Mutual labels:  crawler, spider, scraper
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+59.54%)
Mutual labels:  crawler, spider, scraper
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (-75.67%)
Mutual labels:  crawler, spider, scraper
Avbook
AV 电影管理系统, avmoo , javbus , javlibrary 爬虫,线上 AV 影片图书馆,AV 磁力链接数据库,Japanese Adult Video Library,Adult Video Magnet Links - Japanese Adult Video Database
Stars: ✭ 8,133 (+941.36%)
Mutual labels:  crawler, spider, scraper
Xcrawler
快速、简洁且强大的PHP爬虫框架
Stars: ✭ 344 (-55.95%)
Mutual labels:  crawler, spider, scraper
arachnod
High performance crawler for Nodejs
Stars: ✭ 17 (-97.82%)
Mutual labels:  crawler, scraper, spider
Scrapit
Scraping scripts for various websites.
Stars: ✭ 25 (-96.8%)
Mutual labels:  crawler, spider, scraper
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (-16.01%)
Mutual labels:  crawler, spider, scraper
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-78.1%)
Mutual labels:  crawler, spider, scraper
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+1889.12%)
Mutual labels:  crawler, spider, scraper
Gosint
OSINT Swiss Army Knife
Stars: ✭ 401 (-48.66%)
Mutual labels:  crawler, spider, scraper
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (-43.66%)
Mutual labels:  crawler, spider, scraper
Haipproxy
💖 High available distributed ip proxy pool, powerd by Scrapy and Redis
Stars: ✭ 4,993 (+539.31%)
Mutual labels:  crawler, spider
Grab Site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Stars: ✭ 680 (-12.93%)
Mutual labels:  crawler, spider
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+519.33%)
Mutual labels:  crawler, scraper
Go jobs
带你了解一下Golang的市场行情
Stars: ✭ 526 (-32.65%)
Mutual labels:  crawler, spider

Crawler

Travis CodeBeat Coverage Hex.pm

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.

Features

  • Crawl assets (javascript, css and images).
  • Save to disk.
  • Hook for scraping content.
  • Restrict crawlable domains, paths or content types.
  • Limit concurrent crawlers.
  • Limit rate of crawling.
  • Set the maximum crawl depth.
  • Set timeouts.
  • Set retries strategy.
  • Set crawler's user agent.
  • Manually pause/resume/stop the crawler.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways to access the crawled page data:

  1. Use Crawler.Store
  2. Tap into the registry(?) Crawler.Store.DB
  3. Use your own scraper
  4. If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
  5. Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option Type Default Value Description
:assets list [] Whether to fetch any asset files, available options: "css", "js", "images".
:save_to string nil When provided, the path for saving crawled pages.
:workers integer 10 Maximum number of concurrent workers for crawling.
:interval integer 0 Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit.
:max_depths integer 3 Maximum nested depth of pages to crawl.
:timeout integer 5000 Timeout value for fetching a page, in ms. Can also be set to :infinity, useful when combined with Crawler.pause/1.
:user_agent string Crawler/x.x.x (...) User-Agent value sent by the fetch requests.
:url_filter module Crawler.Fetcher.UrlFilter Custom URL filter, useful for restricting crawlable domains, paths or content types.
:retrier module Crawler.Fetcher.Retrier Custom fetch retrier, useful for retrying failed crawls.
:modifier module Crawler.Fetcher.Modifier Custom modifier, useful for adding custom request headers or options.
:scraper module Crawler.Scraper Custom scraper, useful for scraping content as soon as the parser parses it.
:parser module Crawler.Parser Custom parser, useful for handling parsing differently or to add extra functionalities.
:encode_uri boolean false When set to true apply the URI.encode to the URL to be crawled.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Scraper

See Crawler.Scraper.

defmodule CustomScraper do
  @behaviour Crawler.Scraper.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Modifier

See Crawler.Fetcher.Modifier.

defmodule CustomModifier do
  @behaviour Crawler.Fetcher.Modifier.Spec
end

Pause / Resume / Stop Crawler

Crawler provides pause/1, resume/1 and stop/1, see below.

{:ok, opts} = Crawler.crawl("http://elixir-lang.org")

Crawler.pause(opts)

Crawler.resume(opts)

Crawler.stop(opts)

Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see CHANGELOG.md.

License

Licensed under MIT.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].