2young2simple / Yispider
一款分布式爬虫平台,帮助你更好的管理和开发爬虫。 内置一套爬虫定义规则(模版),可使用模版快速定义爬虫,也可当作框架手动开发爬虫。(兴趣使然的项目,用的不爽了就更新)
Stars: ✭ 158
Projects that are alternatives of or similar to Yispider
Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+1247.47%)
Mutual labels: crawler, spider
Weibo Topic Spider
微博超级话题爬虫,微博词频统计+情感分析+简单分类,新增肺炎超话爬取数据
Stars: ✭ 128 (-18.99%)
Mutual labels: crawler, spider
Go spider
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.
Stars: ✭ 1,745 (+1004.43%)
Mutual labels: crawler, spider
Examples Of Web Crawlers
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )
Stars: ✭ 10,724 (+6687.34%)
Mutual labels: crawler, spider
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-22.78%)
Mutual labels: crawler, spider
Not Your Average Web Crawler
A web crawler (for bug hunting) that gathers more than you can imagine.
Stars: ✭ 107 (-32.28%)
Mutual labels: crawler, spider
Jlitespider
A lite distributed Java spider framework :-)
Stars: ✭ 151 (-4.43%)
Mutual labels: crawler, spider
Decryptlogin
APIs for loginning some websites by using requests.
Stars: ✭ 1,861 (+1077.85%)
Mutual labels: crawler, spider
Baiduspider
BaiduSpider,一个爬取百度搜索结果的爬虫,目前支持百度网页搜索,百度图片搜索,百度知道搜索,百度视频搜索,百度资讯搜索,百度文库搜索,百度经验搜索和百度百科搜索。
Stars: ✭ 105 (-33.54%)
Mutual labels: crawler, spider
Skycaiji
蓝天采集器是一款免费的数据采集发布爬虫软件,采用php+mysql开发,可部署在云服务器,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Stars: ✭ 1,514 (+858.23%)
Mutual labels: crawler, spider
Crawler Detect
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
Stars: ✭ 1,549 (+880.38%)
Mutual labels: crawler, spider
Digger
Digger is a powerful and flexible web crawler implemented by pure golang
Stars: ✭ 130 (-17.72%)
Mutual labels: crawler, spider
Amazonbigspider
😱Full Automatic Amazon Distributed Spider | 亚马逊分布式四国际站采集选款产品|账号admin,密码adminadmin
Stars: ✭ 140 (-11.39%)
Mutual labels: crawler, spider
YiSpider
A distributed spider platform
介绍
一款分布式爬虫平台,帮助你更好的管理和开发爬虫。 内置一套爬虫定义规则(模版),可使用模版快速定义爬虫,也可当作框架手动开发爬虫
计划
- [x] 增加了更多例子。
- [x] 内置实现了基于redis的调度器。
- [ ] 正在准备管理网页端部分的制作,敬请期待。
架构
目前框架分为2个部分:
1.爬虫部分(spider节点):
内部结构参考python scrapy框架,主要由 schedule,page process,pipline 4个部分组成,单个爬虫单独调度器,单独上下文管理,目前内置2中pipline的方式,控制台和文件,节点信息注册在etcd上用于manage节点发现。
-
core
:负责爬虫生命周期、上下文的管理,负责爬虫的运行。 -
schedule
:负责爬虫请求的调度。(基于 channel 或 redis 的调度器) -
process
:负责请求结果的处理。 -
pipline
: 结果的输出输出到不同渠道,如控制台,文件,消息队列,数据库等等 -
register
:负责服务的注册(目前只支持etcd) -
http
: 提供一些http接口
2.管理部分(manage节点):
负责spider节点的管理,用etcd进行spider节点的发现。通过http与spider节点通讯。
开始使用
例子
example-spider包内有大量实例
- 哔哩哔哩
- 嘀哩嘀哩
- 豆瓣电影
- 好奇心日报
- 京东
- 穷游
- 糗百
- 推库
- 网易云音乐
请求介绍
初始请求(Request)Url有2种语法糖方式,用于简便易用:
http://xxx/xxx/{begin-end,offset}
1.start = 0 20 40 ... 10000
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}
http://xxx/xxx/{aa|bb|cc}
2.start = 0 20 40 60
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0|20|40|60}
http://www.dilidili.wang{$href} (AddQueue特有)
3.如果 href = "/abc" (href是process解析出的参数)
url = http://www.dilidili.wang{$href}
url = http://www.dilidili.wang/abc
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-$count,20}
等等
实例
1. Json模版
http接口调用
curl -d '{"id":"douban-movie","Name":"douban-movie","request":[{"url":"https://movie.douban.com/j/new_search_subjects?sort=T\u0026range=0,10\u0026tags=\u0026start={0-100,20}","method":"get","type":"","data":null,"header":null,"cookies":{"url":"","data":""},"process_name":"movie"}],"process":[{"name":"movie","reg_url":null,"type":"json","template_rule":{"Rule":null},"json_rule":{"Rule":{"casts":"casts","cover":"cover","id":"id","node":"array|data","rate":"rate","star":"star","title":"title","url":"url"}},"add_queue":null}],"pipline":"file","depth":0,"end_count":0}' "http://127.0.0.1:7774/task/addAndRun"
豆瓣电影模版
{
"id": "douban-movie",
"Name": "douban-movie",
"request": [
{
"url": "https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10,20}",
"method": "get",
"process_name": "movie"
}
],
"process": [
{
"name": "movie",
"type": "json",
"json_rule": {
"Rule": {
"casts": "casts",
"cover": "cover",
"id": "id",
"node": "array|data",
"rate": "rate",
"star": "star",
"title": "title",
"url": "url"
}
},
"add_queue": null
}
],
"pipline": "file",
"depth": 0,
"end_count": 0
}
dilidili模版
{
"id": "dilidili",
"Name": "dilidili",
"request": [
{
"url": "http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
"method": "get",
"process_name": "animelist"
}
],
"process": [
{
"name": "animelist",
"type": "template",
"template_rule": {
"Rule": {
"content": "text|dd div",
"desc": "text|dd p",
"href": "attr.href|dt a",
"img": "attr.src|dt a img",
"node": "array|.anime_list dl",
"title": "text|dd h3 a"
}
},
"add_queue": [
{
"url": "http://www.dilidili.wang{href}",
"method": "get",
"process_name": "animeinfo"
}
]
},
{
"name": "animeinfo",
"type": "template",
"template_rule": {
"Rule": {
"episode": "texts|.time_con .swiper-slide .clear li a em",
"episode-link": "attrs.href|.time_con .swiper-slide .clear li a",
"title": "text|.detail dl dd h1"
}
},
"add_queue": [
{
"url": "{episode-link}",
"method": "get",
"process_name": "episodeinfo"
}
]
},
{
"name": "episodeinfo",
"reg_url": null,
"type": "template",
"template_rule": {
"Rule": {
"player": "attr.src|.player_main iframe",
"title": "text|#intro2 h1",
"url": "attr.href|link[rel=\"canonical\"]"
}
},
"add_queue": null
}
],
"pipline": "file",
"depth": 0,
"end_count": 0
}
2. 代码模版 编写
豆瓣电影
package main
import (
"YiSpider/spider/model"
"YiSpider/spider"
spider2 "YiSpider/spider/spider"
)
func main(){
task := &model.Task{
Id:"douban-movie",
Name:"douban-movie",
Request:[]*model.Request{
{
Method:"get",
Url:"https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}",
ProcessName:"movie",
},
},
Process: []model.Process{
{
Name:"movie",
Type:"json",
JsonRule:model.JsonRule{
Rule:map[string]string{
"node":"array|data",
"rate":"rate",
"star":"star",
"id":"id",
"url":"url",
"title":"title",
"cover":"cover",
"casts":"casts",
},
},
},
},
Pipline:"file",
}
app := spider.New()
app.AddSpider(spider2.InitWithTask(task))
app.Run()
}
dilidili番剧
package main
import (
"YiSpider/spider/model"
"YiSpider/spider"
spider2 "YiSpider/spider/spider"
)
func main(){
task := &model.Task{
Id:"dilidili",
Name:"dilidili",
Request:[]*model.Request{
{
Method:"get",
Url:"http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
ProcessName:"animelist",
},
},
Process: []model.Process{
{
Name:"animelist",
Type:"template",
TemplateRule:model.TemplateRule{
Rule:map[string]string{
"node":"array|.anime_list dl",
"img":"attr.src|dt a img",
"title":"text|dd h3 a",
"href":"attr.href|dt a",
"content":"text|dd div",
"desc":"text|dd p",
},
},
AddQueue:[]*model.Request{
{
Method: "get",
Url: "http://www.dilidili.wang{$href}",
ProcessName: "animeinfo",
},
},
},
{
Name:"animeinfo",
Type:"template",
TemplateRule:model.TemplateRule{
Rule:map[string]string{
"episode":"texts|.time_con .swiper-slide .clear li a em",
"title":"text|.detail dl dd h1",
"episode-link":"attrs.href|.time_con .swiper-slide .clear li a",
},
},
AddQueue:[]*model.Request{
{
Method: "get",
Url: "{$episode-link}",
ProcessName: "episodeinfo",
},
},
},
{
Name:"episodeinfo",
Type:"template",
TemplateRule:model.TemplateRule{
Rule:map[string]string{
"url":"attr.href|link[rel=\"canonical\"]",
"title":"text|#intro2 h1",
"player":"attr.src|.player_main iframe",
},
},
},
},
Pipline:"file",
}
app := spider.New()
app.AddSpider(spider2.InitWithTask(task))
app.Run()
}
- 纯代码编写
type Movies struct {
Datas []Movie `json:"data"`
}
type Movie struct {
Rate string `json:"rate"`
Start string `json:"start"`
Id string `json:"id"`
Url string `json:"url"`
Title string `json:"title"`
Cover string `json:"cover"`
Casts []string `json:"casts"`
}
type PageProcess struct{}
func (p *PageProcess) Process(context model.Context) (*model.Page, error) {
movies := Movies{}
if err := json.Unmarshal(context.Body, &movies); err != nil {
return nil, err
}
page := &model.Page{}
for _, movie := range movies.Datas {
page.AddResult(movie)
}
return page, nil
}
func main() {
sp := &spider2.Spider{}
sp.Name = "douban-movie-code"
sp.Id = "douban-movie-code"
sp.Requests = []*model.Request{
{
Method: "get",
Url: "https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}",
ProcessName: "movie",
},
}
sp.AddProcess("movie", &PageProcess{})
sp.Pipline = file.NewFilePipline("./")
app := spider.New()
app.AddSpider(sp)
app.Run()
}
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].