All Projects → TurnerSoftware → Infinitycrawler

TurnerSoftware / Infinitycrawler

Licence: mit
A simple but powerful web crawler library for .NET

Projects that are alternatives of or similar to Infinitycrawler

flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-50.52%)
Mutual labels:  crawler, web-crawler
Supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Stars: ✭ 306 (+215.46%)
Mutual labels:  crawler, web-crawler
CrawlBox
Easy way to brute-force web directory.
Stars: ✭ 118 (+21.65%)
Mutual labels:  crawler, web-crawler
Crawler Detect
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
Stars: ✭ 1,549 (+1496.91%)
Mutual labels:  hacktoberfest, crawler
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+576.29%)
Mutual labels:  crawler, web-crawler
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+43552.58%)
Mutual labels:  hacktoberfest, crawler
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+185.57%)
Mutual labels:  crawler, web-crawler
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+1921.65%)
Mutual labels:  crawler, web-crawler
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+4841.24%)
Mutual labels:  crawler, web-crawler
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+4886.6%)
Mutual labels:  hacktoberfest, crawler
Strong Web Crawler
基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。
Stars: ✭ 238 (+145.36%)
Mutual labels:  crawler, web-crawler
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+8551.55%)
Mutual labels:  crawler, web-crawler
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (+104.12%)
Mutual labels:  crawler, web-crawler
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (+138.14%)
Mutual labels:  hacktoberfest, crawler
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (+87.63%)
Mutual labels:  crawler, web-crawler
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (+164.95%)
Mutual labels:  crawler, web-crawler
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+1560.82%)
Mutual labels:  crawler, web-crawler
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (+25.77%)
Mutual labels:  crawler, web-crawler
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (+276.29%)
Mutual labels:  crawler, web-crawler
Maman
Rust Web Crawler saving pages on Redis
Stars: ✭ 39 (-59.79%)
Mutual labels:  crawler, web-crawler

Infinity Crawler

A simple but powerful web crawler library in C#

AppVeyor Codecov NuGet

Features

  • Obeys robots.txt (crawl delay & allow/disallow)
  • Obeys in-page robots rules (X-Robots-Tag header and <meta name="robots" /> tag)
  • Uses sitemap.xml to seed the initial crawl of the site
  • Built around a parallel task async/await system
  • Swappable request and content processors, allowing greater customisation
  • Auto-throttling (see below)

Polite Crawling

The crawler is built around fast but "polite" crawling of website. This is accomplished through a number of settings that allow adjustments of delays and throttles.

You can control:

  • Number of simulatenous requests
  • The delay between requests starting (Note: If a crawl-delay is defined for the User-agent, that will be the minimum)
  • Artificial "jitter" in request delays (requests seem less "robotic")
  • Timeout for a request before throttling will apply for new requests
  • Throttling request backoff: The amount of time added to the delay to throttle requests (this is cumulative)
  • Minimum number of requests under the throttle timeout before the throttle is gradually removed

Other Settings

  • Control the UserAgent used in the crawling process
  • Set additional host aliases you want the crawling process to follow (for example, subdomains)
  • The max number of retries for a specific URI
  • The max number of redirects to follow
  • The max number of pages to crawl

Example Usage

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	}
});
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].