Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

postmodern / Spidr

Licence: mit

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Programming Languages

ruby

36898 projects - #4 most used programming language

Labels

web crawler spider scraper web-scraping web-crawler web-scraper

Projects that are alternatives of or similar to Spidr

Awesome Crawler

A collection of awesome web crawler,spider in different languages

Stars: ✭ 4,793 (+630.64%)

Mutual labels: crawler, spider, scraper, web-crawler, web-scraper

Gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Stars: ✭ 277 (-57.77%)

Mutual labels: crawler, spider, web-scraping, web-crawler

OLX Scraper

📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.

Stars: ✭ 15 (-97.71%)

Mutual labels: scraper, web-crawler, web-scraper, web-scraping

Arachnid

Powerful web scraping framework for Crystal

Stars: ✭ 68 (-89.63%)

Mutual labels: crawler, spider, web-scraping, web-scraper

papercut

Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.

Stars: ✭ 15 (-97.71%)

Mutual labels: crawler, scraper, web-scraping

Linkedin-Client

Web scraper for grabing data from Linkedin profiles or company pages (personal project)

Stars: ✭ 42 (-93.6%)

Mutual labels: scraper, web-scraper, web-scraping

Scrapple

A framework for creating semi-automatic web content extractors

Stars: ✭ 464 (-29.27%)

Mutual labels: crawler, web-scraping, web-scraper

Querylist

🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

Stars: ✭ 2,392 (+264.63%)

Mutual labels: crawler, spider, scraper

flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons

Stars: ✭ 48 (-92.68%)

Mutual labels: crawler, spider, web-crawler

arachnod

High performance crawler for Nodejs

Stars: ✭ 17 (-97.41%)

Mutual labels: crawler, scraper, spider

Freshonions Torscraper

Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

Stars: ✭ 348 (-46.95%)

Mutual labels: crawler, spider, scraper

TikTokDownloader PyWebIO

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音|TikTok数据爬取工具，支持API调用，在线批量解析及下载。

Stars: ✭ 919 (+40.09%)

Mutual labels: scraper, spider, web-scraping

ant

A web crawler for Go

Stars: ✭ 264 (-59.76%)

Mutual labels: scraper, spider, web-crawler

Colly

Elegant Scraper and Crawler Framework for Golang

Stars: ✭ 15,535 (+2268.14%)

Mutual labels: crawler, spider, scraper

Xcrawler

快速、简洁且强大的PHP爬虫框架

Stars: ✭ 344 (-47.56%)

Mutual labels: crawler, spider, scraper

Fbcrawl

A Facebook crawler

Stars: ✭ 536 (-18.29%)

Mutual labels: crawler, spider, scraper

Gosint

OSINT Swiss Army Knife

Stars: ✭ 401 (-38.87%)

Mutual labels: crawler, spider, scraper

Zhihu Crawler People

A simple distributed crawler for zhihu && data analysis

Stars: ✭ 182 (-72.26%)

Mutual labels: crawler, spider, web-crawler

Goribot

[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Stars: ✭ 190 (-71.04%)

Mutual labels: crawler, spider, scraper

Autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Stars: ✭ 4,077 (+521.49%)

Mutual labels: crawler, scraper, web-scraping

View All Similar Projects ➔

Spidr

Description

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Features

Follows:
- a tags.
- iframe tags.
- frame tags.
- Cookie protected links.
- HTTP 300, 301, 302, 303 and 307 Redirects.
- Meta-Refresh Redirects.
- HTTP Basic Auth protected links.
Black-list or white-list URLs based upon:
- URL scheme.
- Host name
- Port number
- Full link
- URL extension
- Optional /robots.txt support.
Provides callbacks for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.
- Every origin and destination URI of a link.
- Every URL that failed to be visited.
Provides action methods to:
- Pause spidering.
- Skip processing of pages.
- Skip processing of links.
Restore the spidering queue and history from a previous session.
Custom User-Agent strings.
Custom proxy settings.
HTTPS support.

Examples

Start spidering from a URL:

Spidr.start_at('http://tenderlovemaking.com/')

Spider a host:

Spidr.host('solnic.eu')

Spider a site:

Spidr.site('http://www.rubyflow.com/')

Spider multiple hosts:

Spidr.start_at(
  'http://company.com/',
  hosts: [
    'company.com',
    /host[\d]+\.company\.com/
  ]
)

Do not spider certain links:

Spidr.site('http://company.com/', ignore_links: [%{^/blog/}])

Do not spider links on certain ports:

Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080])

Do not spider links blacklisted in robots.txt:

Spidr.site(
  'http://company.com/',
  robots: true
)

Print out visited URLs:

Spidr.site('http://www.rubyinside.com/') do |spider|
  spider.every_url { |url| puts url }
end

Build a URL map of a site:

url_map = Hash.new { |hash,key| hash[key] = [] }

Spidr.site('http://intranet.com/') do |spider|
  spider.every_link do |origin,dest|
    url_map[dest] << origin
  end
end

Print out the URLs that could not be requested:

Spidr.site('http://company.com/') do |spider|
  spider.every_failed_url { |url| puts url }
end

Finds all pages which have broken links:

url_map = Hash.new { |hash,key| hash[key] = [] }

spider = Spidr.site('http://intranet.com/') do |spider|
  spider.every_link do |origin,dest|
    url_map[dest] << origin
  end
end

spider.failures.each do |url|
  puts "Broken link #{url} found in:"

  url_map[url].each { |page| puts "  #{page}" }
end

Search HTML and XML pages:

Spidr.site('http://company.com/') do |spider|
  spider.every_page do |page|
    puts ">>> #{page.url}"

    page.search('//meta').each do |meta|
      name = (meta.attributes['name'] || meta.attributes['http-equiv'])
      value = meta.attributes['content']

      puts "  #{name} = #{value}"
    end
  end
end

Print out the titles from every page:

Spidr.site('https://www.ruby-lang.org/') do |spider|
  spider.every_html_page do |page|
    puts page.title
  end
end

Find what kinds of web servers a host is using, by accessing the headers:

servers = Set[]

Spidr.host('company.com') do |spider|
  spider.all_headers do |headers|
    servers << headers['server']
  end
end

Pause the spider on a forbidden page:

Spidr.host('company.com') do |spider|
  spider.every_forbidden_page do |page|
    spider.pause!
  end
end

Skip the processing of a page:

Spidr.host('company.com') do |spider|
  spider.every_missing_page do |page|
    spider.skip_page!
  end
end

Skip the processing of links:

Spidr.host('company.com') do |spider|
  spider.every_url do |url|
    if url.path.split('/').find { |dir| dir.to_i > 1000 }
      spider.skip_link!
    end
  end
end

Requirements

ruby >= 2.0.0
nokogiri ~> 1.3

Install

$ gem install spidr

License

See {file:LICENSE.txt} for license information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 656

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (21) 🔗