All Projects → sjdirect → Abot

sjdirect / Abot

Licence: apache-2.0
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Programming Languages

C#
18002 projects

Projects that are alternatives of or similar to Abot

Abotx
Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
Stars: ✭ 63 (-96.79%)
Mutual labels:  spider, web-crawler, cross-platform, netcore
SockNet
The easiest and fastest way to work with sockets in C#
Stars: ✭ 42 (-97.86%)
Mutual labels:  netcore, netcore2, netcore3
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (-17.85%)
Mutual labels:  crawler, spider, web-crawler
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-93.78%)
Mutual labels:  crawler, spider, web-crawler
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (-90.72%)
Mutual labels:  crawler, spider, web-crawler
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-97.55%)
Mutual labels:  crawler, spider, web-crawler
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-85.87%)
Mutual labels:  crawler, spider, web-crawler
Creeper
🐾 Creeper - The Next Generation Crawler Framework (Go)
Stars: ✭ 762 (-61.14%)
Mutual labels:  crawler, spider, cross-platform
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (-66.55%)
Mutual labels:  crawler, spider, web-crawler
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+144.42%)
Mutual labels:  crawler, spider, web-crawler
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (-81.39%)
Mutual labels:  crawler, spider, web-crawler
Maman
Rust Web Crawler saving pages on Redis
Stars: ✭ 39 (-98.01%)
Mutual labels:  crawler, spider, web-crawler
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+327.94%)
Mutual labels:  crawler, spider, web-crawler
Genvue
GenVue is a hostable, web application that lets confidential users upload and share private files. Tech stack: Net Core 2.0 + Vue.js + Vuex + OpenIddict + Vuetifyjs + EF + SQLServer/Postgress
Stars: ✭ 116 (-94.08%)
Mutual labels:  netcore, netcore2
Bilibili member crawler
B站用户爬虫 好耶~是爬虫
Stars: ✭ 115 (-94.14%)
Mutual labels:  crawler, spider
Examples Of Web Crawlers
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )
Stars: ✭ 10,724 (+446.86%)
Mutual labels:  crawler, spider
Douban Movie
Golang爬虫 爬取豆瓣电影Top250
Stars: ✭ 114 (-94.19%)
Mutual labels:  crawler, spider
Decryptlogin
APIs for loginning some websites by using requests.
Stars: ✭ 1,861 (-5.1%)
Mutual labels:  crawler, spider
Free proxy website
获取免费socks/https/http代理的网站集合
Stars: ✭ 119 (-93.93%)
Mutual labels:  crawler, spider
Expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
Stars: ✭ 113 (-94.24%)
Mutual labels:  parsing, cross-platform

Abot Build Status NuGet

Please star this project!!

C# web crawler built for speed and flexibility.

Abot is an open source C# web crawler framework built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just register for events to process the page data. You can also plugin your own implementations of core interfaces to take complete control over the crawl process. Abot Nuget package version >= 2.0 targets Dotnet Standard 2.0 and Abot Nuget package version < 2.0 targets .NET version 4.0 which makes it highly compatible with many .net framework/core implementations.

What's So Great About It?
  • Open Source (Free for commercial and personal use)
  • It's fast, really fast!!
  • Easily customizable (Pluggable architecture allows you to decide what gets crawled and how)
  • Heavily unit tested (High code coverage)
  • Very lightweight (not over engineered)
  • No out of process dependencies (no databases, no installed services, etc...)
Links of Interest
Use AbotX for more powerful extensions/wrappers




Quick Start

Installing Abot
  • Install Abot using Nuget
PM> Install-Package Abot
Using Abot
using System;
using System.Threading.Tasks;
using Abot2.Core;
using Abot2.Crawler;
using Abot2.Poco;
using Serilog;

namespace TestAbotUse
{
    class Program
    {
        static async Task Main(string[] args)
        {
            Log.Logger = new LoggerConfiguration()
                .MinimumLevel.Information()
                .WriteTo.Console()
                .CreateLogger();

            Log.Logger.Information("Demo starting up!");

            await DemoSimpleCrawler();
            await DemoSinglePageRequest();
        }

        private static async Task DemoSimpleCrawler()
        {
            var config = new CrawlConfiguration
            {
                MaxPagesToCrawl = 10, //Only crawl 10 pages
                MinCrawlDelayPerDomainMilliSeconds = 3000 //Wait this many millisecs between requests
            };
            var crawler = new PoliteWebCrawler(config);

            crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available...

            var crawlResult = await crawler.CrawlAsync(new Uri("http://!!!!!!!!YOURSITEHERE!!!!!!!!!.com"));
        }

        private static async Task DemoSinglePageRequest()
        {
            var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());

            var crawledPage = await pageRequester.MakeRequestAsync(new Uri("http://google.com"));
            Log.Logger.Information("{result}", new
            {
                url = crawledPage.Uri,
                status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
            });
        }

        private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
        {
            var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
            var rawPageText = e.CrawledPage.Content.Text;
        }
    }
}

Abot Configuration

Abot's Abot2.Poco.CrawlConfiguration class has a ton of configuration options. You can see what effect each config value has on the crawl by looking at the code comments .

var crawlConfig = new CrawlConfiguration();
crawlConfig.CrawlTimeoutSeconds = 100;
crawlConfig.MaxConcurrentThreads = 10;
crawlConfig.MaxPagesToCrawl = 1000;
crawlConfig.UserAgentString = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36";
crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111");
crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");
etc...

Abot Events

Register for events and create processing methods

crawler.PageCrawlStarting += crawler_ProcessPageCrawlStarting;
crawler.PageCrawlCompleted += crawler_ProcessPageCrawlCompleted;
crawler.PageCrawlDisallowed += crawler_PageCrawlDisallowed;
crawler.PageLinksCrawlDisallowed += crawler_PageLinksCrawlDisallowed;
void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
	PageToCrawl pageToCrawl = e.PageToCrawl;
	Console.WriteLine($"About to crawl link {pageToCrawl.Uri.AbsoluteUri} which was found on page {pageToCrawl.ParentUri.AbsoluteUri}");
}

void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
	CrawledPage crawledPage = e.CrawledPage;	
	if (crawledPage.HttpRequestException != null || crawledPage.HttpResponseMessage.StatusCode != HttpStatusCode.OK)
		Console.WriteLine($"Crawl of page failed {crawledPage.Uri.AbsoluteUri}");
	else
		Console.WriteLine($"Crawl of page succeeded {crawledPage.Uri.AbsoluteUri}");

	if (string.IsNullOrEmpty(crawledPage.Content.Text))
		Console.WriteLine($"Page had no content {crawledPage.Uri.AbsoluteUri}");

	var angleSharpHtmlDocument = crawledPage.AngleSharpHtmlDocument; //AngleSharp parser
}

void crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e)
{
	CrawledPage crawledPage = e.CrawledPage;
	Console.WriteLine($"Did not crawl the links on page {crawledPage.Uri.AbsoluteUri} due to {e.DisallowedReason}");
}

void crawler_PageCrawlDisallowed(object sender, PageCrawlDisallowedArgs e)
{
	PageToCrawl pageToCrawl = e.PageToCrawl;
	Console.WriteLine($"Did not crawl page {pageToCrawl.Uri.AbsoluteUri} due to {e.DisallowedReason}");
}

Custom objects and the dynamic crawl bag

Add any number of custom objects to the dynamic crawl bag or page bag. These objects will be available in the CrawlContext.CrawlBag object, PageToCrawl.PageBag object or CrawledPage.PageBag object.

var crawler crawler = new PoliteWebCrawler();
crawler.CrawlBag.MyFoo1 = new Foo();
crawler.CrawlBag.MyFoo2 = new Foo();
crawler.PageCrawlStarting += crawler_ProcessPageCrawlStarting;
...
void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
    //Get your Foo instances from the CrawlContext object
    var foo1 = e.CrawlConext.CrawlBag.MyFoo1;
    var foo2 = e.CrawlConext.CrawlBag.MyFoo2;

    //Also add a dynamic value to the PageToCrawl or CrawledPage
    e.PageToCrawl.PageBag.Bar = new Bar();
}

Cancellation

CancellationTokenSource cancellationTokenSource = new CancellationTokenSource();

var crawler = new PoliteWebCrawler();
var result = await crawler.CrawlAsync(new Uri("addurihere"), cancellationTokenSource);





Customizing Crawl Behavior

Abot was designed to be as pluggable as possible. This allows you to easily alter the way it works to suite your needs.

The easiest way to change Abot's behavior for common features is to change the config values that control them. See the Quick Start page for examples on the different ways Abot can be configured.

CrawlDecision Callbacks/Delegates

Sometimes you don't want to create a class and go through the ceremony of extending a base class or implementing the interface directly. For all you lazy developers out there Abot provides a shorthand method to easily add your custom crawl decision logic. NOTE: The ICrawlDecisionMaker's corresponding method is called first and if it does not "allow" a decision, these callbacks will not be called.

var crawler = new PoliteWebCrawler();

crawler.ShouldCrawlPageDecisionMaker = (pageToCrawl, crawlContext) => 
{
	var decision = new CrawlDecision{ Allow = true };
	if(pageToCrawl.Uri.Authority == "google.com")
		return new CrawlDecision{ Allow = false, Reason = "Dont want to crawl google pages" };
	
	return decision;
};

crawler.ShouldDownloadPageContentDecisionMaker = (crawledPage, crawlContext) =>
{
	var decision = new CrawlDecision{ Allow = true };
	if (!crawledPage.Uri.AbsoluteUri.Contains(".com"))
		return new CrawlDecision { Allow = false, Reason = "Only download raw page content for .com tlds" };

	return decision;
};

crawler.ShouldCrawlPageLinksDecisionMaker = (crawledPage, crawlContext) =>
{
	var decision = new CrawlDecision{ Allow = true };
	if (crawledPage.Content.Bytes.Length < 100)
		return new CrawlDecision { Allow = false, Reason = "Just crawl links in pages that have at least 100 bytes" };

	return decision;
};

Custom Implementations

PoliteWebCrawler is the master of orchestrating the crawl. Its job is to coordinate all the utility classes to "crawl" a site. PoliteWebCrawler accepts an alternate implementation for all its dependencies through its constructor.

var crawler = new PoliteWebCrawler(
    	new CrawlConfiguration(),
	new YourCrawlDecisionMaker(),
	new YourThreadMgr(), 
	new YourScheduler(), 
	new YourPageRequester(), 
	new YourHyperLinkParser(), 
	new YourMemoryManager(), 
    	new YourDomainRateLimiter,
	new YourRobotsDotTextFinder());

Passing null for any implementation will use the default. The example below will use your custom implementation for the IPageRequester and IHyperLinkParser but will use the default for all others.

var crawler = new PoliteWebCrawler(
	null, 
	null, 
    	null,
    	null,
	new YourPageRequester(), 
	new YourHyperLinkParser(), 
	null,
    	null, 
	null);

The following are explanations of each interface that PoliteWebCrawler relies on to do the real work.

ICrawlDecisionMaker

The callback/delegate shortcuts are great to add a small amount of logic but if you are doing anything more heavy you will want to pass in your custom implementation of ICrawlDecisionMaker. The crawler calls this implementation to see whether a page should be crawled, whether the page's content should be downloaded and whether a crawled page's links should be crawled.

CrawlDecisionMaker.cs is the default ICrawlDecisionMaker used by Abot. This class takes care of common checks like making sure the config value MaxPagesToCrawl is not exceeded. Most users will only need to create a class that extends CrawlDecision maker and just add their custom logic. However, you are completely free to create a class that implements ICrawlDecisionMaker and pass it into PoliteWebCrawlers constructor.

/// <summary>
/// Determines what pages should be crawled, whether the raw content should be downloaded and if the links on a page should be crawled
/// </summary>
public interface ICrawlDecisionMaker
{
	/// <summary>
	/// Decides whether the page should be crawled
	/// </summary>
	CrawlDecision ShouldCrawlPage(PageToCrawl pageToCrawl, CrawlContext crawlContext);

	/// <summary>
	/// Decides whether the page's links should be crawled
	/// </summary>
	CrawlDecision ShouldCrawlPageLinks(CrawledPage crawledPage, CrawlContext crawlContext);

	/// <summary>
	/// Decides whether the page's content should be dowloaded
	/// </summary>
	CrawlDecision ShouldDownloadPageContent(CrawledPage crawledPage, CrawlContext crawlContext);
}
IThreadManager

The IThreadManager interface deals with the multithreading details. It is used by the crawler to manage concurrent http requests.

TaskThreadManager.cs is the default IThreadManager used by Abot.

/// <summary>
/// Handles the multithreading implementation details
/// </summary>
public interface IThreadManager : IDisposable
{
	/// <summary>
	/// Max number of threads to use.
	/// </summary>
	int MaxThreads { get; }

	/// <summary>
	/// Will perform the action asynchrously on a seperate thread
	/// </summary>
	/// <param name="action">The action to perform</param>
	void DoWork(Action action);

	/// <summary>
	/// Whether there are running threads
	/// </summary>
	bool HasRunningThreads();

	/// <summary>
	/// Abort all running threads
	/// </summary>
	void AbortAll();
}
IScheduler

The IScheduler interface deals with managing what pages need to be crawled. The crawler gives the links it finds to and gets the pages to crawl from the IScheduler implementation. A common use cases for writing your own implementation might be to distribute crawls across multiple machines which could be managed by a DistributedScheduler.

Scheduler.cs is the default IScheduler used by the crawler and by default is constructed with in memory collection to determine what pages have been crawled and which need to be crawled.

/// <summary>
/// Handles managing the priority of what pages need to be crawled
/// </summary>
public interface IScheduler
{
	/// <summary>
	/// Count of remaining items that are currently scheduled
	/// </summary>
	int Count { get; }

	/// <summary>
	/// Schedules the param to be crawled
	/// </summary>
	void Add(PageToCrawl page);

	/// <summary>
	/// Schedules the param to be crawled
	/// </summary>
	void Add(IEnumerable<PageToCrawl> pages);

	/// <summary>
	/// Gets the next page to crawl
	/// </summary>
	PageToCrawl GetNext();

	/// <summary>
	/// Clear all currently scheduled pages
	/// </summary>
	void Clear();
}
IPageRequester

The IPageRequester interface deals with making the raw http requests.

PageRequester.cs is the default IPageRequester used by the crawler.

public interface IPageRequester : IDisposable
{
	/// <summary>
	/// Make an http web request to the url and download its content
	/// </summary>
	Task<CrawledPage> MakeRequestAsync(Uri uri);

	/// <summary>
	/// Make an http web request to the url and download its content based on the param func decision
	/// </summary>
	Task<CrawledPage> MakeRequestAsync(Uri uri, Func<CrawledPage, CrawlDecision> shouldDownloadContent);
}
IHyperLinkParser

The IHyperLinkParser interface deals with parsing the links out of raw html.

AngleSharpHyperLinkParser.cs is the default IHyperLinkParser used by the crawler. It uses the well known AngleSharp to do the html parsing. AngleSharp uses a css style selector like jquery but all in c#.

/// <summary>
/// Handles parsing hyperlikns out of the raw html
/// </summary>
public interface IHyperLinkParser
{
	/// <summary>
	/// Parses html to extract hyperlinks, converts each into an absolute url
	/// </summary>
	IEnumerable<Uri> GetLinks(CrawledPage crawledPage);
}
IMemoryManager

The IMemoryManager handles memory monitoring. This feature is still experimental and could be removed in a future release if found to be unreliable.

MemoryManager.cs is the default implementation used by the crawler.

/// <summary>
/// Handles memory monitoring/usage
/// </summary>
public interface IMemoryManager : IMemoryMonitor, IDisposable
{
	/// <summary>
	/// Whether the current process that is hosting this instance is allocated/using above the param value of memory in mb
	/// </summary>
	bool IsCurrentUsageAbove(int sizeInMb);

	/// <summary>
	/// Whether there is at least the param value of available memory in mb
	/// </summary>
	bool IsSpaceAvailable(int sizeInMb);
}
IDomainRateLimiter

The IDomainRateLimiter handles domain rate limiting. It will handle determining how much time needs to elapse before it is ok to make another http request to the domain.

DomainRateLimiter.cs is the default implementation used by the crawler.

/// <summary>
/// Rate limits or throttles on a per domain basis
/// </summary>
public interface IDomainRateLimiter
{
	/// <summary>
	/// If the domain of the param has been flagged for rate limiting, it will be rate limited according to the configured minimum crawl delay
	/// </summary>
	void RateLimit(Uri uri);

	/// <summary>
	/// Add a domain entry so that domain may be rate limited according the the param minumum crawl delay
	/// </summary>
	void AddDomain(Uri uri, long minCrawlDelayInMillisecs);
}
IRobotsDotTextFinder

The IRobotsDotTextFinder is responsible for retrieving the robots.txt file for every domain (if isRespectRobotsDotTextEnabled="true") and building the robots.txt abstraction which implements the IRobotsDotText interface.

RobotsDotTextFinder.cs is the default implementation used by the crawler.

/// <summary>
/// Finds and builds the robots.txt file abstraction
/// </summary>
public interface IRobotsDotTextFinder
{
	/// <summary>
	/// Finds the robots.txt file using the rootUri. 
        /// 
	IRobotsDotText Find(Uri rootUri);
}





Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].