Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → 2young2simple → Yispider

2young2simple / Yispider

一款分布式爬虫平台，帮助你更好的管理和开发爬虫。内置一套爬虫定义规则（模版），可使用模版快速定义爬虫，也可当作框架手动开发爬虫。(兴趣使然的项目，用的不爽了就更新)

Programming Languages

31211 projects - #10 most used programming language

3204 projects

Labels

Projects that are alternatives of or similar to Yispider

Python爬虫实战 - 模拟登陆各大网站包含但不限于：滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝，如果喜欢请start ❤️

Stars: ✭ 2,129 (+1247.47%)

Mutual labels: crawler, spider

Weibo Topic Spider

微博超级话题爬虫，微博词频统计+情感分析+简单分类，新增肺炎超话爬取数据

Stars: ✭ 128 (-18.99%)

Mutual labels: crawler, spider

Free proxy website

获取免费socks/https/http代理的网站集合

Stars: ✭ 119 (-24.68%)

Mutual labels: crawler, spider

Bilibili member crawler

B站用户爬虫好耶~是爬虫

Stars: ✭ 115 (-27.22%)

Mutual labels: crawler, spider

[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

Stars: ✭ 1,745 (+1004.43%)

Mutual labels: crawler, spider

Examples Of Web Crawlers

一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )

Stars: ✭ 10,724 (+6687.34%)

Mutual labels: crawler, spider

Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

Stars: ✭ 122 (-22.78%)

Mutual labels: crawler, spider

Not Your Average Web Crawler

A web crawler (for bug hunting) that gathers more than you can imagine.

Stars: ✭ 107 (-32.28%)

Mutual labels: crawler, spider

A lite distributed Java spider framework :-)

Stars: ✭ 151 (-4.43%)

Mutual labels: crawler, spider

MM131网站图片爬取 🚨

Stars: ✭ 129 (-18.35%)

Mutual labels: crawler, spider

Golang爬虫爬取豆瓣电影Top250

Stars: ✭ 114 (-27.85%)

Mutual labels: crawler, spider

Crawler China Mainland Universities

中国大陆大学列表爬虫

Stars: ✭ 143 (-9.49%)

Mutual labels: crawler, spider

爬取北大法宝网http://www.pkulaw.cn/Case/

Stars: ✭ 113 (-28.48%)

Mutual labels: crawler, spider

APIs for loginning some websites by using requests.

Stars: ✭ 1,861 (+1077.85%)

Mutual labels: crawler, spider

BaiduSpider，一个爬取百度搜索结果的爬虫，目前支持百度网页搜索，百度图片搜索，百度知道搜索，百度视频搜索，百度资讯搜索，百度文库搜索，百度经验搜索和百度百科搜索。

Stars: ✭ 105 (-33.54%)

Mutual labels: crawler, spider

简单易用的Python爬虫框架，QQ交流群：597510560

Stars: ✭ 1,611 (+919.62%)

Mutual labels: crawler, spider

蓝天采集器是一款免费的数据采集发布爬虫软件，采用php+mysql开发，可部署在云服务器，几乎能采集所有类型的网页，无缝对接各类CMS建站程序，免登录实时发布数据，全自动无需人工干预！是网页大数据采集软件中完全跨平台的云端爬虫系统

Stars: ✭ 1,514 (+858.23%)

Mutual labels: crawler, spider

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Stars: ✭ 1,549 (+880.38%)

Mutual labels: crawler, spider

Digger is a powerful and flexible web crawler implemented by pure golang

Stars: ✭ 130 (-17.72%)

Mutual labels: crawler, spider

Amazonbigspider

😱Full Automatic Amazon Distributed Spider | 亚马逊分布式四国际站采集选款产品|账号admin,密码adminadmin

Stars: ✭ 140 (-11.39%)

Mutual labels: crawler, spider

View All Similar Projects ➔

YiSpider

A distributed spider platform

介绍

一款分布式爬虫平台，帮助你更好的管理和开发爬虫。内置一套爬虫定义规则（模版），可使用模版快速定义爬虫，也可当作框架手动开发爬虫

计划

[x] 增加了更多例子。
[x] 内置实现了基于redis的调度器。
[ ] 正在准备管理网页端部分的制作，敬请期待。

架构

目前框架分为2个部分:

1.爬虫部分（spider节点）:

内部结构参考python scrapy框架，主要由 schedule,page process,pipline 4个部分组成，单个爬虫单独调度器，单独上下文管理,目前内置2中pipline的方式，控制台和文件,节点信息注册在etcd上用于manage节点发现。

core:负责爬虫生命周期、上下文的管理，负责爬虫的运行。
schedule:负责爬虫请求的调度。(基于 channel 或 redis 的调度器)
process：负责请求结果的处理。
pipline：结果的输出输出到不同渠道,如控制台，文件，消息队列，数据库等等
register：负责服务的注册（目前只支持etcd)
http: 提供一些http接口

2.管理部分（manage节点）:

负责spider节点的管理，用etcd进行spider节点的发现。通过http与spider节点通讯。

开始使用

例子

example-spider包内有大量实例

哔哩哔哩
嘀哩嘀哩
豆瓣电影
好奇心日报
京东
穷游
糗百
推库
网易云音乐

请求介绍

初始请求（Request）Url有2种语法糖方式,用于简便易用：

1. http://xxx/xxx/{begin-end,offset}

start = 0 20 40 ... 10000
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}

2. http://xxx/xxx/{aa|bb|cc}

start = 0 20 40 60
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0|20|40|60}

3.http://www.dilidili.wang{$href} (AddQueue特有)

如果 href = "/abc" (href是process解析出的参数)
url = http://www.dilidili.wang{$href}
url = http://www.dilidili.wang/abc
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-$count,20}
等等

实例

1. Json模版

http接口调用
curl -d '{"id":"douban-movie","Name":"douban-movie","request":[{"url":"https://movie.douban.com/j/new_search_subjects?sort=T\u0026range=0,10\u0026tags=\u0026start={0-100,20}","method":"get","type":"","data":null,"header":null,"cookies":{"url":"","data":""},"process_name":"movie"}],"process":[{"name":"movie","reg_url":null,"type":"json","template_rule":{"Rule":null},"json_rule":{"Rule":{"casts":"casts","cover":"cover","id":"id","node":"array|data","rate":"rate","star":"star","title":"title","url":"url"}},"add_queue":null}],"pipline":"file","depth":0,"end_count":0}' "http://127.0.0.1:7774/task/addAndRun"

豆瓣电影模版

 {
    "id": "douban-movie",
    "Name": "douban-movie",
    "request": [
        {
            "url": "https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10,20}",
            "method": "get",
            "process_name": "movie"
        }
    ],
    "process": [
        {
            "name": "movie",
            "type": "json",
            "json_rule": {
                "Rule": {
                    "casts": "casts",
                    "cover": "cover",
                    "id": "id",
                    "node": "array|data",
                    "rate": "rate",
                    "star": "star",
                    "title": "title",
                    "url": "url"
                }
            },
            "add_queue": null
        }
    ],
    "pipline": "file",
    "depth": 0,
    "end_count": 0
}

dilidili模版

   {
    "id": "dilidili",
    "Name": "dilidili",
    "request": [
        {
            "url": "http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
            "method": "get",
            "process_name": "animelist"
        }
    ],
    "process": [
        {
            "name": "animelist",
            "type": "template",
            "template_rule": {
                "Rule": {
                    "content": "text|dd div",
                    "desc": "text|dd p",
                    "href": "attr.href|dt a",
                    "img": "attr.src|dt a img",
                    "node": "array|.anime_list dl",
                    "title": "text|dd h3 a"
                }
            },
            "add_queue": [
                {
                    "url": "http://www.dilidili.wang{href}",
                    "method": "get",
                    "process_name": "animeinfo"
                }
            ]
        },
        {
            "name": "animeinfo",
            "type": "template",
            "template_rule": {
                "Rule": {
                    "episode": "texts|.time_con .swiper-slide .clear li a em",
                    "episode-link": "attrs.href|.time_con .swiper-slide .clear li a",
                    "title": "text|.detail dl dd h1"
                }
            },
            "add_queue": [
                {
                    "url": "{episode-link}",
                    "method": "get",
                    "process_name": "episodeinfo"
                }
            ]
        },
        {
            "name": "episodeinfo",
            "reg_url": null,
            "type": "template",
            "template_rule": {
                "Rule": {
                    "player": "attr.src|.player_main iframe",
                    "title": "text|#intro2 h1",
                    "url": "attr.href|link[rel=\"canonical\"]"
                }
            },
            "add_queue": null
        }
    ],
    "pipline": "file",
    "depth": 0,
    "end_count": 0
}

2. 代码模版编写

豆瓣电影

package main

import (
	"YiSpider/spider/model"
	"YiSpider/spider"
	spider2 "YiSpider/spider/spider"
)

func main(){

	task := &model.Task{
		Id:"douban-movie",
		Name:"douban-movie",
		Request:[]*model.Request{
			{
				Method:"get",
				Url:"https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}",
				ProcessName:"movie",
			},
		},
		Process: []model.Process{
			{
				Name:"movie",
				Type:"json",
				JsonRule:model.JsonRule{
					Rule:map[string]string{
						"node":"array|data",
						"rate":"rate",
						"star":"star",
						"id":"id",
						"url":"url",
						"title":"title",
						"cover":"cover",
						"casts":"casts",
					},
				},
			},
		},
		Pipline:"file",
	}

	app := spider.New()
	app.AddSpider(spider2.InitWithTask(task))
	app.Run()
}

dilidili番剧

package main

import (
	"YiSpider/spider/model"
	"YiSpider/spider"
	spider2 "YiSpider/spider/spider"
)

func main(){

	task := &model.Task{
		Id:"dilidili",
		Name:"dilidili",
		Request:[]*model.Request{
			{
				Method:"get",
				Url:"http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
				ProcessName:"animelist",
			},
		},
		Process: []model.Process{
			{
				Name:"animelist",
				Type:"template",
				TemplateRule:model.TemplateRule{
					Rule:map[string]string{
						"node":"array|.anime_list dl",
						"img":"attr.src|dt a img",
						"title":"text|dd h3 a",
						"href":"attr.href|dt a",
						"content":"text|dd div",
						"desc":"text|dd p",
					},
				},
				AddQueue:[]*model.Request{
					{
						Method:      "get",
						Url:         "http://www.dilidili.wang{$href}",
						ProcessName: "animeinfo",
					},
				},
			},
			{
				Name:"animeinfo",
				Type:"template",
				TemplateRule:model.TemplateRule{
					Rule:map[string]string{
						"episode":"texts|.time_con .swiper-slide .clear li a em",
						"title":"text|.detail dl dd h1",
						"episode-link":"attrs.href|.time_con .swiper-slide .clear li a",
					},
				},
				AddQueue:[]*model.Request{
					{
						Method:      "get",
						Url:         "{$episode-link}",
						ProcessName: "episodeinfo",
					},
				},
			},
			{
				Name:"episodeinfo",
				Type:"template",
				TemplateRule:model.TemplateRule{
					Rule:map[string]string{
						"url":"attr.href|link[rel=\"canonical\"]",
						"title":"text|#intro2 h1",
						"player":"attr.src|.player_main iframe",
					},
				},
			},
		},

		Pipline:"file",
	}


	app := spider.New()
	app.AddSpider(spider2.InitWithTask(task))
	app.Run()

}

纯代码编写

type Movies struct {
	Datas []Movie `json:"data"`
}
type Movie struct {
	Rate  string   `json:"rate"`
	Start string   `json:"start"`
	Id    string   `json:"id"`
	Url   string   `json:"url"`
	Title string   `json:"title"`
	Cover string   `json:"cover"`
	Casts []string `json:"casts"`
}

type PageProcess struct{}

func (p *PageProcess) Process(context model.Context) (*model.Page, error) {
	movies := Movies{}
	if err := json.Unmarshal(context.Body, &movies); err != nil {
		return nil, err
	}
	page := &model.Page{}
	for _, movie := range movies.Datas {
		page.AddResult(movie)
	}
	return page, nil
}

func main() {
	sp := &spider2.Spider{}
	sp.Name = "douban-movie-code"
	sp.Id = "douban-movie-code"
	sp.Requests = []*model.Request{
		{
			Method:      "get",
			Url:         "https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}",
			ProcessName: "movie",
		},
	}
	sp.AddProcess("movie", &PageProcess{})
	sp.Pipline = file.NewFilePipline("./")

	app := spider.New()
	app.AddSpider(sp)
	app.Run()
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 158

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗