All Projects → xboxeer → NScrapy

xboxeer / NScrapy

Licence: Apache-2.0 license
NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider

Programming Languages

C#
18002 projects

Projects that are alternatives of or similar to NScrapy

Gerapy
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Stars: ✭ 2,601 (+2855.68%)
Mutual labels:  spider, distributed, scrapy
Haipproxy
💖 High available distributed ip proxy pool, powerd by Scrapy and Redis
Stars: ✭ 4,993 (+5573.86%)
Mutual labels:  spider, distributed, scrapy
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (+111.36%)
Mutual labels:  spider, scrapy
Scrapydweb
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO 👉
Stars: ✭ 2,385 (+2610.23%)
Mutual labels:  spider, scrapy
Py Elasticsearch Django
基于python语言开发的千万级别搜索引擎
Stars: ✭ 207 (+135.23%)
Mutual labels:  spider, scrapy
Fp Server
Free proxy server, continuously crawling and providing proxies, based on Tornado and Scrapy. 免费代理服务器,基于Tornado和Scrapy,在本地搭建属于自己的代理池
Stars: ✭ 154 (+75%)
Mutual labels:  spider, scrapy
Scrapingoutsourcing
ScrapingOutsourcing专注分享爬虫代码 尽量每周更新一个
Stars: ✭ 164 (+86.36%)
Mutual labels:  spider, scrapy
Zi5book
book.zi5.me全站kindle电子书籍爬取,按照作者书籍名分类,每本书有mobi和equb两种格式,采用分布式进行全站爬取
Stars: ✭ 191 (+117.05%)
Mutual labels:  spider, distributed
Taobaoscrapy
😩Tool For Taobao/Tmall| 儿时玩具已经过时
Stars: ✭ 146 (+65.91%)
Mutual labels:  spider, scrapy
Spider job
招聘网数据爬虫
Stars: ✭ 234 (+165.91%)
Mutual labels:  spider, scrapy
Spiderkeeper
admin ui for scrapy/open source scrapinghub
Stars: ✭ 2,562 (+2811.36%)
Mutual labels:  spider, scrapy
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (-4.55%)
Mutual labels:  spider, scrapy
Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+2319.32%)
Mutual labels:  spider, scrapy
Jlitespider
A lite distributed Java spider framework :-)
Stars: ✭ 151 (+71.59%)
Mutual labels:  spider, distributed
Spoon
🥄 A package for building specific Proxy Pool for different Sites.
Stars: ✭ 173 (+96.59%)
Mutual labels:  spider, distributed
Awesome Web Scraper
A collection of awesome web scaper, crawler.
Stars: ✭ 147 (+67.05%)
Mutual labels:  spider, scrapy
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (+115.91%)
Mutual labels:  spider, scrapy
Feapder
feapder是一款支持分布式、批次采集、任务防丢、报警丰富的python爬虫框架
Stars: ✭ 110 (+25%)
Mutual labels:  spider, scrapy
Scrapy demo
all kinds of scrapy demo
Stars: ✭ 128 (+45.45%)
Mutual labels:  spider, scrapy
scrapy-kafka-redis
Distributed crawling/scraping, Kafka And Redis based components for Scrapy
Stars: ✭ 45 (-48.86%)
Mutual labels:  distributed, scrapy

NScrapy

buildpass license netversion release

NScrapy is a Distributed Spider Framework based on .net core and Redis. the idea of NScrapy comes from Scrapy, so you can write the spider in a very similar way to Scrapy

NScrapy 是基于.net core 异步编程框架,Redis内存存储的一款开源分布式爬虫框架, NScrapy的整体思想源于知名的python爬虫框架Scrapy,整体上的写法也接近于Scrapy

NScrapy Sample code

Below is a sample of NScrapy, the sample will visit Liepin, which is a Recruit web site Based on the seed URL defined in the [URL] attribute, NScrapy will visit each Postion information in detail page(the ParseItem method) , and visit the next page automatically(the VisitPage method). It is not necessary for the Spider writer to know how the Spiders distributed in different machine/process communicate with each other, and how the Downloader process get the urt that need to download, just tell NScrapy the seed URL, inhirt Spider.Spdier class and write some call back, NScrapy will take the rest of the work NScrapy support different kind of extension, including add your own DownloaderMiddleware, config HTTP header, user agent pool.

如下是一段简单的NScrapy爬虫,该爬虫会抓取猎聘网上所有php的职位信息并做相应的输出 基于定义在[URL] attribute 中的种子URL,NScrapy会访问每一个职位信息的详细信息页面(ParseItem method), 并且自动爬取下一页信息(VisitPage method) 爬虫作者不需要关心如何管理分布式爬虫之间如何互相通信,下载器如何获取待下载队列,下载器池是如何维护的,仅仅需要告诉NScrapy一个种子链接, 继承Spider.Spider类,并完成默认回调函数就可以爬去信息 NScrapy支持丰富的自定义扩展,包括在配置文件appsetting.json中加入DownloaderMiddware,配置Http请求头,构造User Agent pool等

Usage:

using NScrapy.Infra;
using NScrapy.Infra.Attributes.SpiderAttributes;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;

namespace NScrapy.Project
{
    class Program
    {
    static void Main(string[] args)
    {
        //Init shell of NScrapy, which will init the context of NScrapy
        var shell = NScrapy.Shell.NScrapy.GetInstance(); 
        //Specify the Spider that you want to start
        shell.Crawl("JobSpider");
        return;
    }
}
[Name(Name = "JobSpider")]
[URL("https://www.liepin.com/zhaopin/?industries=&dqs=&salary=&jobKind=&pubTime=30&compkind=&compscale=&industryType=&searchType=1&clean_condition=&isAnalysis=&init=1&sortFlag=15&flushckid=0&fromSearchBtn=1&headckid=bb314f611fde073c&d_headId=4b294eff4ad202db83d4ed085fcbf94b&d_ckId=01fb643c53d14dd44d7991e27c98c51b&d_sfrom=search_prime&d_curPage=0&d_pageSize=40&siTag=k_cloHQj_hyIn0SLM9IfRg~UoKQA1_uiNxxEb8RglVcHg&key=php")]
public class JobSpider : Spider.Spider
{
    private string startingTime = DateTime.Now.ToString("yyyyMMddhhmm");
    public JobSpider()
    {
    }
    //爬取种子链接
    public override void ResponseHandler(IResponse response)
    {
        var httpResponse = response as HttpResponse;
        var returnValue = response.CssSelector(".job-info h3 a::attr(href)");            
        var pages = response.CssSelector(".pagerbar a::attr(href)").Extract();
        foreach (var page in pages)
        {
            if (!page.Contains("javascript"))
            {
                NScrapy.Shell.NScrapy.GetInstance().Follow(returnValue,page, VisitPage);
            }
        }
        VisitPage(returnValue);
    }
    //翻页
    private void VisitPage(IResponse returnValue)
    {
        var hrefs = returnValue.CssSelector(".job-info h3 a::attr(href)").Extract();
        foreach (var href in hrefs)
        {
            //Use ItemLoader
            NScrapy.Shell.NScrapy.GetInstance().Follow(returnValue, href, ParseItem);
        }
        var pages = returnValue.CssSelector(".pagerbar a::attr(href)").Extract();
        foreach (var page in pages)
        {
            if (!page.Contains("javascript"))
            {
                NScrapy.Shell.NScrapy.GetInstance().Follow(returnValue, page, VisitPage);
            }
        }
    }
    //在具体岗位的招聘页面上获取信息
    public void ParseItem(IResponse response)
    {
        //Add Field Mapping to the HTML Dom element
        var itemLoader = new ItemLoader<JobItem>(response);
        itemLoader.AddFieldMapping("Title", "css:.title-info h1::attr(text)");
        itemLoader.AddFieldMapping("Title","css:.job-title h1::attr(text)");

        itemLoader.AddFieldMapping("Firm","css:.title-info h3 a::attr(text)");
        itemLoader.AddFieldMapping("Firm", "css:.title-info h3::attr(text)");
        itemLoader.AddFieldMapping("Firm","css:.title-info h3");
        itemLoader.AddFieldMapping("Firm","css:.job-title h2::attr(text)");

        itemLoader.AddFieldMapping("Salary", "css:.job-main-title p::attr(text)");
        itemLoader.AddFieldMapping("Salary", "css:.job-main-title strong::attr(text)");
        itemLoader.AddFieldMapping("Salary", "css:.job-item-title p::attr(text)");
        itemLoader.AddFieldMapping("Salary", "css:.job-item-title");

        itemLoader.AddFieldMapping("Time","css:.job-title-left time::attr(title)");
        itemLoader.AddFieldMapping("Time","css:.job-title-left time::attr(text)");
        var item = itemLoader.LoadItem();
        //#In the example here, simple write the Position Firm information at the console, you can write the information to anywhere else
        Console.WriteLine(item.Firm);
    }
    
}

public class JobItem
{
    public string Firm { get; set; }
    public string Title { get; set; }
    public string Salary { get; set; }
    public string Time { get; set; }
}
}

分布式运行,Redis支持

Distributed NScrapy, supported by Redis

修改Project项目中appsetting.json,添加如下节点

Modify the appsetting.json in your NScrapy Project as below

"Scheduler": {
  "SchedulerType": "NScrapy.Scheduler.RedisExt.RedisScheduler"
},
"Scheduler.RedisExt": {
  "RedisServer": "192.168.0.106",//具体的redis地址
  "RedisPort": "6379",//具体的redis端口
  "ReceiverQueue": "NScrapy.Downloader",//Downloader监听的队列名称
  "ResponseQueue": "NScrapy.ResponseQueue"//Spider监听的队列名称
}, 

修改NScrapy.DownloaderShell.dll同层目录中的appsetting.json,内容同上面一样

Modify appsetting.json under NScrapy.DownloaderShell.dll directory

"Scheduler": {
  "SchedulerType": "NScrapy.Scheduler.RedisExt.RedisScheduler"
},
"Scheduler.RedisExt": {
  "RedisServer": "192.168.0.106",//具体的redis地址 Redis Server
  "RedisPort": "6379",//具体的redis端口 Redis Server Port
  "ReceiverQueue": "NScrapy.Downloader",//Downloader监听的队列名称 Queue for Downloader
  "ResponseQueue": "NScrapy.ResponseQueue"//Spider监听的队列名称 Queue for Spider
}, 

单独运行DownloaderShell

Run the DownloaderShell individually

dotnet %DownloaderShellPath%/NScrapy.DownloaderShell.dll

如果需要将Downloader本身状态更新到Redis,可以添加下面的中间件到DownloaderShell(目前正在开发的NScrapyWebConsole会从Redis中读取Downloader的状态数据)

If you want to update the status of individual Downloader status to Redis, add the below DownloaderShell middleware to appsetting.json(The in develop project NScrapyWebConsole will read Downloader Status from Redis)

"DownloaderMiddlewares": [
  { "Middleware": "NScrapy.DownloaderShell.StatusUpdaterMiddleware" }
],

如果需要将抓取到的内容添加到MongoDB中 可以创建如下PipelineItem

If you want to add the data that you captured to a MongoDB, you can add below PipelineItem

public class MongoItemPipeline : IPipeline<JobItem>
{
    private MongoClient client = new MongoClient("mongodb://localhost:27017");
    public async  void ProcessItem(JobItem item, ISpider spider)
    {
        var db = client.GetDatabase("Lianjia");
        var collection = db.GetCollection<JobItem>("JobItem");
        await collection.InsertOneAsync(item);
    }
}

然后将该Pipeline添加到project 的 appsetting.json中

Add the Pipeline to your project's appsetting.json

"Pipelines": [
  { "Pipeline": "NScrapy.Project.MongoItemPipeline" }
],

相应的如果想要存储到CSV文件中 也可以添加CSV pipeline

You can also save your data in CSV by adding CSV pipeline as below

 public class CSVItemPipeline : IPipeline<JobItem>
{
    private string startTime = DateTime.Now.ToString("yyyyMMddhhmm");
    
    public void ProcessItem(JobItem item, ISpider spider)
    {
        var info = $"{item.Title},{item.Firm},{item.SalaryFrom},{item.SalaryTo},{item.Location},{item.Time},{item.URL},{System.Environment.NewLine}";
        Console.WriteLine(info);
        File.AppendAllText($"output-{startTime}.csv", info,Encoding.UTF8);    
    }
}

并添加该pipeline item到appsetting.json中

Add the pipeline item in appsetting.json

"Pipelines": [
  { "Pipeline": "NScrapy.Project.MongoItemPipeline" },
  { "Pipeline": "NScrapy.Project.CSVItemPipeline" }
],
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].