Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → watzon → Arachnid

watzon / Arachnid

Licence: mit

Powerful web scraping framework for Crystal

Programming Languages

crystal

512 projects

Labels

bot crawler spider web-scraping crawling web-scraper

Projects that are alternatives of or similar to Arachnid

Spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+864.71%)

Mutual labels: crawler, spider, web-scraping, web-scraper

Gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Stars: ✭ 277 (+307.35%)

Mutual labels: crawler, spider, web-scraping, crawling

Colly

Elegant Scraper and Crawler Framework for Golang

Stars: ✭ 15,535 (+22745.59%)

Mutual labels: crawler, spider, crawling

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (+151.47%)

Mutual labels: crawler, spider, crawling

Laravel Crawler Detect

A Laravel wrapper for CrawlerDetect - the web crawler detection library

Stars: ✭ 227 (+233.82%)

Mutual labels: bot, crawler, spider

Skycaiji

蓝天采集器是一款免费的数据采集发布爬虫软件，采用php+mysql开发，可部署在云服务器，几乎能采集所有类型的网页，无缝对接各类CMS建站程序，免登录实时发布数据，全自动无需人工干预！是网页大数据采集软件中完全跨平台的云端爬虫系统

Stars: ✭ 1,514 (+2126.47%)

Mutual labels: crawler, spider, crawling

Scrapit

Scraping scripts for various websites.

Stars: ✭ 25 (-63.24%)

Mutual labels: bot, crawler, spider

Instagram Bot

An Instagram bot developed using the Selenium Framework

Stars: ✭ 138 (+102.94%)

Mutual labels: bot, crawler, crawling

Webster

a reliable high-level web crawling & scraping framework for Node.js.

Stars: ✭ 364 (+435.29%)

Mutual labels: crawler, spider, crawling

Crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

Stars: ✭ 440 (+547.06%)

Mutual labels: crawler, spider, crawling

flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons

Stars: ✭ 48 (-29.41%)

Mutual labels: crawler, spider, crawling

Scrapple

A framework for creating semi-automatic web content extractors

Stars: ✭ 464 (+582.35%)

Mutual labels: crawler, web-scraping, web-scraper

Awesome Crawler

A collection of awesome web crawler,spider in different languages

Stars: ✭ 4,793 (+6948.53%)

Mutual labels: crawler, spider, web-scraper

Torbot

Dark Web OSINT Tool

Stars: ✭ 821 (+1107.35%)

Mutual labels: crawler, spider

Zhihu Crawler

zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目

Stars: ✭ 890 (+1208.82%)

Mutual labels: crawler, spider

Maman

Rust Web Crawler saving pages on Redis

Stars: ✭ 39 (-42.65%)

Mutual labels: crawler, spider

Lulu

[Unmaintained] A simple and clean video/music/image downloader 👾

Stars: ✭ 789 (+1060.29%)

Mutual labels: crawler, crawling

Nodespider

[DEPRECATED] Simple, flexible, delightful web crawler/spider package

Stars: ✭ 33 (-51.47%)

Mutual labels: crawler, spider

Lizard

💐 Full Amazon Automatic Download

Stars: ✭ 41 (-39.71%)

Mutual labels: crawler, spider

Crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Stars: ✭ 8,392 (+12241.18%)

Mutual labels: crawler, spider

View All Similar Projects ➔

Arachnid

Arachnid is a fast, soon to be multi-threading capable web crawler for Crystal. It recenty underwent a full rewrite for Crystal 0.35.1, so see the documentation below for updated usage instructions.

Installation

Add the dependency to your shard.yml:

dependencies:
  arachnid:
    github: watzon/arachnid

Run shards install

Usage

First, of course, you need to require arachnid in your project:

require "arachnid"

The Agent

Agent is the class that does all the heavy lifting and will be the main one you interact with. To create a new Agent, use Agent.new.

agent = Arachnid::Agent.new

The initialize method takes a bunch of optional parameters:

`:client`

You can, if you wish, supply your own HTTP::Client instance to the Agent. This can be useful if you want to use a proxy, provided the proxy client extends HTTP::Client.

`:user_agent`

The user agent to be added to every request header. You can override this on a per-host basis with either :host_headers or :default_headers.

`:default_headers`

The default headers to be used in every request.

`:host_headers`

Headers to be applied on a per-host basis. This is a hash String (host name) => HTTP::Headers.

`:queue`

The Arachnid::Queue instance to use for storing links waiting to be processed. The default is a MemoryQueue (which is the only one for now), but you can easily implement your own Queue using whatever you want as a backend.

`:stop_on_empty`

Whether or not to stop running when the queue is empty. This is true by default. If it's made false, the loop will continue when the queue empties, so be sure you have a way to keep adding items to the queue.

`:follow_redirects`

Whether or not to follow redirects (add them to the queue).

Starting the Agent

There are four ways to start your Agent once it's been created. Here are some examples:

`#start_at`

#start_at starts the Agent running on a particular URL. It adds a single URL to the queue and starts there.

agent.start_at("https://crystal-lang.org") do
  # ...
end

`#site`

#site starts the agent running at the given URL and adds a rule that keeps the agent restricted to the given site. This allows the agent to scan the given domain and any subdomains. For instance:

agent.site("https://crystal-lang.org") do
  # ...
end

The above will match crystal-lang.org and forum.crystal-lang.org, but not github.com/crystal-lang or any other site not within the *.crystal-lang.org space.

`#host`

#host is like site, but with the added restriction of just remaining on the current domain path. Subdomains are not included.

agent.host("crystal-lang.org") do
  # ...
end

`#start`

Provided you already have URIs in the queue ready to be scanned, you can also just use #start to start the Agent running.

agent.enqueue("https://crystal-lang.org")
agent.enqueue("https://kemalcr.com")
agent.start

Filters

URI's can be filtered before being enqueued. There are two kinds of filters, accept and reject. Accept filters can be used to ensure that a URI matches before being enqueued. Reject filters do the opposite, keeping URIs from being enqueued if they do match.

For instance:

# This will filter out all sites where the host is not "crystal-lang.org"
agent.accept_filter { |uri| uri.host == "crystal-lang.org" }

If you want to ignore certain parts of the above filter:

# This will ignore paths starting with "/api"
agent.reject_filter { |uri| uri.path.to_s.starts_with?("/api") }

The #site and #host methods add a default accept filter in order to keep things in the given site or host.

Resources

All the above is useless if you can't do anything with the scanned resources, which is why we have the Resource class. Every scanned resource is converted into a Resource (or subclass) based on the content type. For instance, text/html becomes a Resource::HTML which is parsed using kostya/myhtml for extra speed.

Each resource has an associated Agent#on_ method so you can do something when one of those resources is scanned:

agent.on_html do |page|
  puts typeof(page)
  # => Arachnid::Resource::HTML

  puts page.title
  # => The Title of the Page
end

Currently we have:

#on_html
#on_image
#on_script
#on_stylesheet
#on_xml

There is also #on_resource which is called for every resource, including ones that don't match the above types. Resources all include, at minimum the URI at which the resource was found, and the response (HTTP::Client::Response) instance.

Contributing

Fork it (https://github.com/watzon/arachnid/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

your-name-here - creator and maintainer

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 68

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗