Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → TurnerSoftware → Infinitycrawler

TurnerSoftware / Infinitycrawler

Licence: mit

A simple but powerful web crawler library for .NET

Labels

hacktoberfest crawler web-crawler

Projects that are alternatives of or similar to Infinitycrawler

flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons

Stars: ✭ 48 (-50.52%)

Mutual labels: crawler, web-crawler

Supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

Stars: ✭ 306 (+215.46%)

Mutual labels: crawler, web-crawler

CrawlBox

Easy way to brute-force web directory.

Stars: ✭ 118 (+21.65%)

Mutual labels: crawler, web-crawler

Crawler Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Stars: ✭ 1,549 (+1496.91%)

Mutual labels: hacktoberfest, crawler

Spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+576.29%)

Mutual labels: crawler, web-crawler

Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Stars: ✭ 42,343 (+43552.58%)

Mutual labels: hacktoberfest, crawler

Gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Stars: ✭ 277 (+185.57%)

Mutual labels: crawler, web-crawler

Abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Stars: ✭ 1,961 (+1921.65%)

Mutual labels: crawler, web-crawler

Awesome Crawler

A collection of awesome web crawler,spider in different languages

Stars: ✭ 4,793 (+4841.24%)

Mutual labels: crawler, web-crawler

Ferret

Declarative web scraping

Stars: ✭ 4,837 (+4886.6%)

Mutual labels: hacktoberfest, crawler

Strong Web Crawler

基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。

Stars: ✭ 238 (+145.36%)

Mutual labels: crawler, web-crawler

Crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Stars: ✭ 8,392 (+8551.55%)

Mutual labels: crawler, web-crawler

Antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Stars: ✭ 198 (+104.12%)

Mutual labels: crawler, web-crawler

Skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.

Stars: ✭ 231 (+138.14%)

Mutual labels: hacktoberfest, crawler

Zhihu Crawler People

A simple distributed crawler for zhihu && data analysis

Stars: ✭ 182 (+87.63%)

Mutual labels: crawler, web-crawler

Spidy

The simple, easy to use command line web crawler.

Stars: ✭ 257 (+164.95%)

Mutual labels: crawler, web-crawler

Pspider

简单易用的Python爬虫框架，QQ交流群：597510560

Stars: ✭ 1,611 (+1560.82%)

Mutual labels: crawler, web-crawler

Crawlab Lite

Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

Stars: ✭ 122 (+25.77%)

Mutual labels: crawler, web-crawler

Spider Flow

新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

Stars: ✭ 365 (+276.29%)

Mutual labels: crawler, web-crawler

Maman

Rust Web Crawler saving pages on Redis

Stars: ✭ 39 (-59.79%)

Mutual labels: crawler, web-crawler

View All Similar Projects ➔

Infinity Crawler

A simple but powerful web crawler library in C#

Features

Obeys robots.txt (crawl delay & allow/disallow)
Obeys in-page robots rules (X-Robots-Tag header and <meta name="robots" /> tag)
Uses sitemap.xml to seed the initial crawl of the site
Built around a parallel task async/await system
Swappable request and content processors, allowing greater customisation
Auto-throttling (see below)

Polite Crawling

The crawler is built around fast but "polite" crawling of website. This is accomplished through a number of settings that allow adjustments of delays and throttles.

You can control:

Number of simulatenous requests
The delay between requests starting (Note: If a crawl-delay is defined for the User-agent, that will be the minimum)
Artificial "jitter" in request delays (requests seem less "robotic")
Timeout for a request before throttling will apply for new requests
Throttling request backoff: The amount of time added to the delay to throttle requests (this is cumulative)
Minimum number of requests under the throttle timeout before the throttle is gradually removed

Other Settings

Control the UserAgent used in the crawling process
Set additional host aliases you want the crawling process to follow (for example, subdomains)
The max number of retries for a specific URI
The max number of redirects to follow
The max number of pages to crawl

Example Usage

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	}
});

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 97

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗