All Projects → dotnetcore → Dotnetspider

dotnetcore / Dotnetspider

Licence: mit
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

Programming Languages

C#
18002 projects
HTML
75241 projects
CSS
56736 projects
javascript
184084 projects - #8 most used programming language
TSQL
950 projects
shell
77523 projects

Projects that are alternatives of or similar to Dotnetspider

Chromely
Build HTML Desktop Apps on .NET/.NET Core/.NET 5 using native GUI, HTML5, JavaScript, CSS
Stars: ✭ 2,728 (-15.62%)
Mutual labels:  cross-platform, dotnetcore
Creeper
🐾 Creeper - The Next Generation Crawler Framework (Go)
Stars: ✭ 762 (-76.43%)
Mutual labels:  crawler, cross-platform
Haipproxy
💖 High available distributed ip proxy pool, powerd by Scrapy and Redis
Stars: ✭ 4,993 (+54.44%)
Mutual labels:  crawler, distributed
Hpx
The C++ Standard Library for Parallelism and Concurrency
Stars: ✭ 1,805 (-44.17%)
Mutual labels:  cross-platform, distributed
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (-96.91%)
Mutual labels:  crawler, dotnetcore
Avalonia
A cross platform XAML framework for .NET
Stars: ✭ 12,588 (+289.36%)
Mutual labels:  cross-platform, dotnetcore
Xxl Crawler
A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
Stars: ✭ 561 (-82.65%)
Mutual labels:  crawler, distributed
Smartsql
SmartSql = MyBatis in C# + .NET Core+ Cache(Memory | Redis) + R/W Splitting + PropertyChangedTrack +Dynamic Repository + InvokeSync + Diagnostics
Stars: ✭ 775 (-76.03%)
Mutual labels:  cross-platform, dotnetcore
Scaleable Crawler With Docker Cluster
a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine
Stars: ✭ 96 (-97.03%)
Mutual labels:  crawler, distributed
Lizard
💐 Full Amazon Automatic Download
Stars: ✭ 41 (-98.73%)
Mutual labels:  crawler, distributed
Steamtools
🛠「Steam++」是一个开源跨平台的多功能Steam工具箱。
Stars: ✭ 4,458 (+37.89%)
Mutual labels:  cross-platform, dotnetcore
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (-39.34%)
Mutual labels:  crawler, cross-platform
Module Shop Mini Program
一个基于 .NET Core构建的简单、跨平台、模块化的商城系统
Stars: ✭ 89 (-97.25%)
Mutual labels:  cross-platform, dotnetcore
Jaya
Cross platform file manager application for Windows, Mac and Linux operating systems. (planned mobile support)
Stars: ✭ 219 (-93.23%)
Mutual labels:  cross-platform, dotnetcore
Meissa
Cross-platform Distributed Test Runner. Executes tests in parallel, time balanced on multiple machines.
Stars: ✭ 66 (-97.96%)
Mutual labels:  cross-platform, distributed
Scrapy Redis
Redis-based components for Scrapy.
Stars: ✭ 4,998 (+54.59%)
Mutual labels:  crawler, distributed
Module Shop
一个基于 .NET Core构建的简单、跨平台、模块化的商城系统
Stars: ✭ 398 (-87.69%)
Mutual labels:  cross-platform, dotnetcore
Schedulemastercore
This is a distributed task management system base on .Net Core platform .
Stars: ✭ 549 (-83.02%)
Mutual labels:  cross-platform, distributed
Disec
Distributed Image Search Engine Crawler
Stars: ✭ 11 (-99.66%)
Mutual labels:  crawler, distributed
Jlitespider
A lite distributed Java spider framework :-)
Stars: ✭ 151 (-95.33%)
Mutual labels:  crawler, distributed

DotnetSpider

免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。

Build Status NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework.

If you want get latest beta packages, you should add the myget feed:

<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />

DESIGN

DESIGN IMAGE

DEVELOP ENVIROMENT

  1. Visual Studio 2017 (15.3 or later) or Jetbrains Rider

  2. .NET Core 2.2 or later

  3. Docker

  4. MySql

     docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
    
  5. Redis (option)

     docker run --name redis -d -p 6379:6379 --restart always redis
    
  6. SqlServer

     docker run --name sqlserver -d -p 1433:1433 --restart always  -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
    
  7. PostgreSQL (option)

     docker run --name postgres -d  -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
    
  8. MongoDb (option)

     docker run --name mongo -d -p 27017:27017 --restart always mongo
    
  9. RabbitMQ

    docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \
           -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \
           rabbitmq:3-management
    
  10. Docker remote api for mac

    docker run -d  --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
    
  11. HBase

    docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase
    

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Project DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

	public class EntitySpider : Spider
	{
		public static async Task RunAsync()
		{
			var builder = Builder.CreateDefaultBuilder<EntitySpider>();
			builder.UseSerilog();
			builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
			await builder.Build().RunAsync();
		}

		public EntitySpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
			options, services, logger)
		{
		}

		protected override async Task InitializeAsync(CancellationToken stoppingToken)
		{
			AddDataFlow(new DataParser<CnblogsEntry>());
			AddDataFlow(GetDefaultStorage());
			await AddRequestsAsync(
				new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string> {{"网站", "博客园"}}),
				new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string> {{"网站", "博客园"}}));
		}

		protected override (string Id, string Name) GetIdAndName()
		{
			return (ObjectId.NewId.ToString(), "博客园");
		}

		[Schema("cnblogs", "news")]
		[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
		[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
		[FollowRequestSelector(XPaths = new[] {"//div[@class='pager']"})]
		public class CnblogsEntry : EntityBase<CnblogsEntry>
		{
			protected override void Configure()
			{
				HasIndex(x => x.Title);
				HasIndex(x => new {x.WebSite, x.Guid}, true);
			}

			public int Id { get; set; }

			[Required]
			[StringLength(200)]
			[ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
			public string Category { get; set; }

			[Required]
			[StringLength(200)]
			[ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
			public string WebSite { get; set; }

			[StringLength(200)]
			[ValueSelector(Expression = "//title")]
			[ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
			public string Title { get; set; }

			[StringLength(40)]
			[ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
			public string Guid { get; set; }

			[ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
			public string News { get; set; }

			[ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
			public string Url { get; set; }

			[ValueSelector(Expression = ".//div[@class='entry_summary']")]
			public string PlainText { get; set; }

			[ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
			public DateTime CreationTime { get; set; }
		}
	}

Distributed spider

Read this document

Puppeteer downloader

Coming soon

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Dependencies

Package License
Bert.RateLimiters Apache 2.0
MessagePack MIT
Newtonsoft.Json MIT
Dapper Apache 2.0
HtmlAgilityPack MIT
ZCJ.HashedWheelTimer MIT
murmurhash Apache 2.0
Serilog.AspNetCore Apache 2.0
Serilog.Sinks.Console Apache 2.0
Serilog.Sinks.RollingFile Apache 2.0
Serilog.Sinks.PeriodicBatching Apache 2.0
MongoDB.Driver Apache 2.0
MySqlConnector MIT
AutoMapper.Extensions.Microsoft.DependencyInjection MIT
Docker.DotNet MIT
BuildBundlerMinifier Apache 2.0
Pomelo.EntityFrameworkCore.MySql MIT
Quartz.AspNetCore Apache 2.0
Quartz.AspNetCore.MySqlConnector Apache 2.0
Npgsql PostgreSQL License
RabbitMQ.Client Apache 2.0
Polly BSD 3-C

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].