All Projects → zhupingqi → Ruiji.net

zhupingqi / Ruiji.net

Licence: lgpl-3.0
crawler framework, distributed crawler extractor

Projects that are alternatives of or similar to Ruiji.net

Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+2231.36%)
Mutual labels:  crawler, scraper, headless-chrome
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (-13.64%)
Mutual labels:  crawler, scraper, scrapy
Scrapyrt
HTTP API for Scrapy spiders
Stars: ✭ 637 (+189.55%)
Mutual labels:  crawler, scraper, scrapy
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+143.64%)
Mutual labels:  crawler, scraper, scrapy
Scrapoxy
Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!
Stars: ✭ 1,322 (+500.91%)
Mutual labels:  crawler, scraper, scrapy
Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+867.73%)
Mutual labels:  crawler, scrapy
Instagram Scraper
scrapes medias, likes, followers, tags and all metadata. Inspired by instagram-php-scraper,bot
Stars: ✭ 2,209 (+904.09%)
Mutual labels:  crawler, scraper
Tianyancha
pip安装的天眼查爬虫API,指定的单个/多个企业工商信息一键保存为Excel/JSON格式。A Battery-included Scraper API of Tianyancha, the best Chinese business data and investigation platform.
Stars: ✭ 206 (-6.36%)
Mutual labels:  crawler, scraper
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+6961.36%)
Mutual labels:  crawler, scraper
Onegram
This repository is no longer maintained.
Stars: ✭ 137 (-37.73%)
Mutual labels:  crawler, scraper
Datmusic Api
Alternative for VK Audio API
Stars: ✭ 160 (-27.27%)
Mutual labels:  crawler, scraper
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-22.27%)
Mutual labels:  crawler, scraper
Media Scraper
Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
Stars: ✭ 206 (-6.36%)
Mutual labels:  crawler, scraper
Youtube Projects
This repository contains all the code I use in my YouTube tutorials.
Stars: ✭ 144 (-34.55%)
Mutual labels:  crawler, scraper
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+791.36%)
Mutual labels:  crawler, netcore
Google Play Scraper
Google play scraper for Python inspired by <facundoolano/google-play-scraper>
Stars: ✭ 143 (-35%)
Mutual labels:  crawler, scraper
Scrapingoutsourcing
ScrapingOutsourcing专注分享爬虫代码 尽量每周更新一个
Stars: ✭ 164 (-25.45%)
Mutual labels:  crawler, scrapy
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-15.45%)
Mutual labels:  crawler, scrapy
Github Spider
Github 仓库及用户分析爬虫
Stars: ✭ 190 (-13.64%)
Mutual labels:  crawler, scrapy
Querylist
🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
Stars: ✭ 2,392 (+987.27%)
Mutual labels:  crawler, scraper

Nuget Build status CodeFactor

Build status Build status

About RuiJi Scraper

RuiJi Scraper is a RuiJi expression based browser plug-in that uses visual rule editing and generates RuiJi expressions for RuiJi.Net. firefox

chrome

Contributors

This project exists thanks to all the people who contribute.

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

Support us

https://www.scraperapi.com/?fp_ref=ruijinet

https://promotion.aliyun.com/ntms/yunparter/invite.html?userCode=r1bio67c&utm_source=r1bio67c

About RuiJi.Net

RuiJi.Net is a distributed crawl framework written in netcore.

RuiJi.Net is a self host webapi written using Microsoft.AspNetCore.Owin. Major features include distribute crawler, distribute Extractor and managed cookie.

RuiJi.Net support ip polling that using the server public network address and proxy server.

Documentation

Building http://doc.ruijihg.com/

Demo

http://118.31.61.230:36000/

Features

Crawler

Feature Support
webheader custom
method get/post
auto redirection support
cookie managed/custom
service point ip auto/custom Bind
encoding auto detect/by specify
response raw/string
proxy http

Selectors

Type
CSS
REGEX
REGEXSPLIT
TEXTRANGE
EXCLUDE
REGEXREPLACE
JPATH
XPATH
CLEAR
EXPRESSION
SELECTORPROCESSOR

Extract structure

Image text

Examples

crawl use local ip automatic

var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
var response = crawler.Request(request);

crawl with special ip

var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
request.Ip = "192.168.31.196";
var response = crawler.Request(request);

crawl with proxy

var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
request.Proxy = new RequestProxy("223.93.172.248", 3128);

var response = crawler.Request(request);

extract url

var crawler = new RuiJiCrawler();
var request = new Request("https://www.oschina.net/blog");

var response = crawler.Request(request);
var content = response.Data.ToString();

var parser = new RuiJiParser();
var eb = parser.ParseExtract("css a.blog-title-link[href]\nexp https://my.oschina.net/*/blog/*");
var result = RuiJiExtractor.Extract(content, eb.Block);

extract tile

var crawler = new RuiJiCrawler();
var request = new Request("http://www.ruijihg.com/archives/category/tech/bigdata");

var response = crawler.Request(request);
var content = response.Data.ToString();

var parser = new RuiJiParser();
var eb = parser.ParseExtract(@"[tile]\ncss article:html

[meta]
#title
css .entry-header:text

#summary
css .entry-header + p:text
ex /Read more »/ -e");

var result = RuiJiExtractor.Extract(content, eb.Block);

extract meta

var crawler = new RuiJiCrawler();
var request = new Request("https://my.oschina.net/zhupingqi/blog/1826317");

var response = crawler.Request(request);
var content = response.Data.ToString();

var parser = new RuiJiParser();
var eb = parser.ParseExtract(@"[meta]
#title
css h1.header:text

#author
css div.blog-meta .avatar + span:text

#date
css div.blog-meta > div.item:first:text
regS /发布于/ 1

#words_i
css div.blog-meta > div.item:eq(1):text
regS / / 1

#content
css #articleContent:html");

var result = RuiJiExtractor.Extract(content, eb.Block);

detect mine

var crawler = new RuiJiCrawler();
var request = new Request("http://img10.jiuxian.com/2018/0111/cd51bb851410404388155b3ec2c505cf4.jpg");
var response = crawler.Request(request);

var ex = response.Extensions;

RuiJi.Net Cluster

  1. downloaded ZooKeeper from Apache mirrors http://mirrors.hust.edu.cn/apache/zookeeper/zookeeper-3.4.12/

  2. Add the same file as zoo_sample.cfg in folder conf and rename it to zoo.cfg. and change dataDir with your

  3. Please confirm whether the Java runtime environment is installed

  4. run bin/zkServer.cmd in you zookeepr folder

  5. Start up zookeeper

  6. Compile RuiJi.Net.Cmd and run RuiJi.Net.Cmd.exe

if You see the following information

Server Start At http://x.x.x.x:x
proxy x.x.x.x:x ready to startup!
try connect to zookeeper server : x.x.x.x:2181
zookeeper server connected!

the service startup is complete!

The RuiJi.Net.Cmd.exe have to run as an administrator!
        var request = new Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");

        var response = Crawler.Request(request);

        if (response.StatusCode != System.Net.HttpStatusCode.OK)
            return;

        var content = response.Data.ToString();

        var block = new ExtractBlock();
        block.Selectors = new List<ISelector>
        {
            new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
        };

        block.TileSelector = new ExtractTile
        {
            Selectors = new List<ISelector>
            {
                new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
            }
        };

        block.TileSelector.Metas.AddMeta("title", new List<ISelector> {
            new CssSelector(".pt-cv-title")
        });

        block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
            new CssSelector(".pt-cv-readmore","href")
        });

        var r = Extractor.Extract(new ExtractRequest {
            Block = block,
            Content = content
        });

RuiJi Expression

RuiJi Expression is a way to quickly add the rules of page extraction. The ruiji expressions are as simple and understandable as possible.Before we start, we should first understand the rule model of RuiJi.Net.

The RuiJi expression uses the structure described in the figure above to extract the pages that need to be extracted, and the extraction unit is Block, as shown in the following figure.

Selectors is a list of selector Tiles is a region that needs to be repeatedly extracted Metas is the metadata that needs to be extracted Blocks is a subBlock that needs to be extracted within Block

Image text

If you need to extract http://www.ruijihg.com/开发, you need to observe the structure of the page first.You can use F12 to look at the structure of the page

Image text

First, make sure that the result of the Block selector is unique.

Image text

The definition of Block can be as follows

#content
css .pt-cv-view:ohtml

Continue adding tile

[tile]
    #tiles
    css .pt-cv-content-item:ohtml

    [meta]
    #title
    css .pt-cv-title:text

    #content
    css .pt-cv-content:html
    ex 阅读更多... -e

You may notice \t, because both block and tile contain meta, so the tile selector part and tile meta are \t as the current tile flag.

The complete Block description structure is as follows

[Block]
#blockname
selector

[blocks]
    @subblockname1
    @subblockname2

[tile]
    #tilename
    tile selector

    [meta]
    #meta1
    selector

    #meta2
    selector

[meta]
    #blockmeta1
    selector

    #blockmeta2
    selector

Admin Ui

Contact

Please contact me with any suggestion

[email protected]

my website : www.ruijihg.com

QQ交流群: 545931923

https://github.com/zhupingqi/RuiJi.Net

https://gitee.com/zhupingqi/RuiJi.Net

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].