All Projects → 2young2simple → Yispider

2young2simple / Yispider

一款分布式爬虫平台,帮助你更好的管理和开发爬虫。 内置一套爬虫定义规则(模版),可使用模版快速定义爬虫,也可当作框架手动开发爬虫。(兴趣使然的项目,用的不爽了就更新)

Programming Languages

go
31211 projects - #10 most used programming language
golang
3204 projects

Projects that are alternatives of or similar to Yispider

Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+1247.47%)
Mutual labels:  crawler, spider
Weibo Topic Spider
微博超级话题爬虫,微博词频统计+情感分析+简单分类,新增肺炎超话爬取数据
Stars: ✭ 128 (-18.99%)
Mutual labels:  crawler, spider
Free proxy website
获取免费socks/https/http代理的网站集合
Stars: ✭ 119 (-24.68%)
Mutual labels:  crawler, spider
Bilibili member crawler
B站用户爬虫 好耶~是爬虫
Stars: ✭ 115 (-27.22%)
Mutual labels:  crawler, spider
Go spider
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.
Stars: ✭ 1,745 (+1004.43%)
Mutual labels:  crawler, spider
Examples Of Web Crawlers
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )
Stars: ✭ 10,724 (+6687.34%)
Mutual labels:  crawler, spider
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-22.78%)
Mutual labels:  crawler, spider
Not Your Average Web Crawler
A web crawler (for bug hunting) that gathers more than you can imagine.
Stars: ✭ 107 (-32.28%)
Mutual labels:  crawler, spider
Jlitespider
A lite distributed Java spider framework :-)
Stars: ✭ 151 (-4.43%)
Mutual labels:  crawler, spider
Mm131
MM131网站图片爬取 🚨
Stars: ✭ 129 (-18.35%)
Mutual labels:  crawler, spider
Douban Movie
Golang爬虫 爬取豆瓣电影Top250
Stars: ✭ 114 (-27.85%)
Mutual labels:  crawler, spider
Crawler China Mainland Universities
中国大陆大学列表爬虫
Stars: ✭ 143 (-9.49%)
Mutual labels:  crawler, spider
Pkulaw spider
爬取北大法宝网http://www.pkulaw.cn/Case/
Stars: ✭ 113 (-28.48%)
Mutual labels:  crawler, spider
Decryptlogin
APIs for loginning some websites by using requests.
Stars: ✭ 1,861 (+1077.85%)
Mutual labels:  crawler, spider
Baiduspider
BaiduSpider,一个爬取百度搜索结果的爬虫,目前支持百度网页搜索,百度图片搜索,百度知道搜索,百度视频搜索,百度资讯搜索,百度文库搜索,百度经验搜索和百度百科搜索。
Stars: ✭ 105 (-33.54%)
Mutual labels:  crawler, spider
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+919.62%)
Mutual labels:  crawler, spider
Skycaiji
蓝天采集器是一款免费的数据采集发布爬虫软件,采用php+mysql开发,可部署在云服务器,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Stars: ✭ 1,514 (+858.23%)
Mutual labels:  crawler, spider
Crawler Detect
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
Stars: ✭ 1,549 (+880.38%)
Mutual labels:  crawler, spider
Digger
Digger is a powerful and flexible web crawler implemented by pure golang
Stars: ✭ 130 (-17.72%)
Mutual labels:  crawler, spider
Amazonbigspider
😱Full Automatic Amazon Distributed Spider | 亚马逊分布式四国际站采集选款产品|账号admin,密码adminadmin
Stars: ✭ 140 (-11.39%)
Mutual labels:  crawler, spider

YiSpider

A distributed spider platform

介绍

一款分布式爬虫平台,帮助你更好的管理和开发爬虫。 内置一套爬虫定义规则(模版),可使用模版快速定义爬虫,也可当作框架手动开发爬虫

计划

  • [x] 增加了更多例子。
  • [x] 内置实现了基于redis的调度器。
  • [ ] 正在准备管理网页端部分的制作,敬请期待。

架构

目前框架分为2个部分:

1.爬虫部分(spider节点):

内部结构参考python scrapy框架,主要由 schedule,page process,pipline 4个部分组成,单个爬虫单独调度器,单独上下文管理,目前内置2中pipline的方式,控制台和文件,节点信息注册在etcd上用于manage节点发现。

  • core:负责爬虫生命周期、上下文的管理,负责爬虫的运行。
  • schedule:负责爬虫请求的调度。(基于 channel 或 redis 的调度器)
  • process:负责请求结果的处理。
  • pipline: 结果的输出输出到不同渠道,如控制台,文件,消息队列,数据库等等
  • register:负责服务的注册(目前只支持etcd)
  • http: 提供一些http接口

2.管理部分(manage节点):

负责spider节点的管理,用etcd进行spider节点的发现。通过http与spider节点通讯。

开始使用

例子

example-spider包内有大量实例

  • 哔哩哔哩
  • 嘀哩嘀哩
  • 豆瓣电影
  • 好奇心日报
  • 京东
  • 穷游
  • 糗百
  • 推库
  • 网易云音乐

请求介绍

初始请求(Request)Url有2种语法糖方式,用于简便易用:

1. http://xxx/xxx/{begin-end,offset}

start = 0 20 40 ... 10000
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}

2. http://xxx/xxx/{aa|bb|cc}

start = 0 20 40 60
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0|20|40|60}

3.http://www.dilidili.wang{$href} (AddQueue特有)

如果 href = "/abc" (href是process解析出的参数)
url = http://www.dilidili.wang{$href}
url = http://www.dilidili.wang/abc
url = https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-$count,20}
等等

实例

1. Json模版

http接口调用
curl -d '{"id":"douban-movie","Name":"douban-movie","request":[{"url":"https://movie.douban.com/j/new_search_subjects?sort=T\u0026range=0,10\u0026tags=\u0026start={0-100,20}","method":"get","type":"","data":null,"header":null,"cookies":{"url":"","data":""},"process_name":"movie"}],"process":[{"name":"movie","reg_url":null,"type":"json","template_rule":{"Rule":null},"json_rule":{"Rule":{"casts":"casts","cover":"cover","id":"id","node":"array|data","rate":"rate","star":"star","title":"title","url":"url"}},"add_queue":null}],"pipline":"file","depth":0,"end_count":0}' "http://127.0.0.1:7774/task/addAndRun"

豆瓣电影模版

 {
    "id": "douban-movie",
    "Name": "douban-movie",
    "request": [
        {
            "url": "https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10,20}",
            "method": "get",
            "process_name": "movie"
        }
    ],
    "process": [
        {
            "name": "movie",
            "type": "json",
            "json_rule": {
                "Rule": {
                    "casts": "casts",
                    "cover": "cover",
                    "id": "id",
                    "node": "array|data",
                    "rate": "rate",
                    "star": "star",
                    "title": "title",
                    "url": "url"
                }
            },
            "add_queue": null
        }
    ],
    "pipline": "file",
    "depth": 0,
    "end_count": 0
}

dilidili模版

   {
    "id": "dilidili",
    "Name": "dilidili",
    "request": [
        {
            "url": "http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
            "method": "get",
            "process_name": "animelist"
        }
    ],
    "process": [
        {
            "name": "animelist",
            "type": "template",
            "template_rule": {
                "Rule": {
                    "content": "text|dd div",
                    "desc": "text|dd p",
                    "href": "attr.href|dt a",
                    "img": "attr.src|dt a img",
                    "node": "array|.anime_list dl",
                    "title": "text|dd h3 a"
                }
            },
            "add_queue": [
                {
                    "url": "http://www.dilidili.wang{href}",
                    "method": "get",
                    "process_name": "animeinfo"
                }
            ]
        },
        {
            "name": "animeinfo",
            "type": "template",
            "template_rule": {
                "Rule": {
                    "episode": "texts|.time_con .swiper-slide .clear li a em",
                    "episode-link": "attrs.href|.time_con .swiper-slide .clear li a",
                    "title": "text|.detail dl dd h1"
                }
            },
            "add_queue": [
                {
                    "url": "{episode-link}",
                    "method": "get",
                    "process_name": "episodeinfo"
                }
            ]
        },
        {
            "name": "episodeinfo",
            "reg_url": null,
            "type": "template",
            "template_rule": {
                "Rule": {
                    "player": "attr.src|.player_main iframe",
                    "title": "text|#intro2 h1",
                    "url": "attr.href|link[rel=\"canonical\"]"
                }
            },
            "add_queue": null
        }
    ],
    "pipline": "file",
    "depth": 0,
    "end_count": 0
}

2. 代码模版 编写

豆瓣电影

package main

import (
	"YiSpider/spider/model"
	"YiSpider/spider"
	spider2 "YiSpider/spider/spider"
)

func main(){

	task := &model.Task{
		Id:"douban-movie",
		Name:"douban-movie",
		Request:[]*model.Request{
			{
				Method:"get",
				Url:"https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}",
				ProcessName:"movie",
			},
		},
		Process: []model.Process{
			{
				Name:"movie",
				Type:"json",
				JsonRule:model.JsonRule{
					Rule:map[string]string{
						"node":"array|data",
						"rate":"rate",
						"star":"star",
						"id":"id",
						"url":"url",
						"title":"title",
						"cover":"cover",
						"casts":"casts",
					},
				},
			},
		},
		Pipline:"file",
	}

	app := spider.New()
	app.AddSpider(spider2.InitWithTask(task))
	app.Run()
}

dilidili番剧

package main

import (
	"YiSpider/spider/model"
	"YiSpider/spider"
	spider2 "YiSpider/spider/spider"
)

func main(){

	task := &model.Task{
		Id:"dilidili",
		Name:"dilidili",
		Request:[]*model.Request{
			{
				Method:"get",
				Url:"http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
				ProcessName:"animelist",
			},
		},
		Process: []model.Process{
			{
				Name:"animelist",
				Type:"template",
				TemplateRule:model.TemplateRule{
					Rule:map[string]string{
						"node":"array|.anime_list dl",
						"img":"attr.src|dt a img",
						"title":"text|dd h3 a",
						"href":"attr.href|dt a",
						"content":"text|dd div",
						"desc":"text|dd p",
					},
				},
				AddQueue:[]*model.Request{
					{
						Method:      "get",
						Url:         "http://www.dilidili.wang{$href}",
						ProcessName: "animeinfo",
					},
				},
			},
			{
				Name:"animeinfo",
				Type:"template",
				TemplateRule:model.TemplateRule{
					Rule:map[string]string{
						"episode":"texts|.time_con .swiper-slide .clear li a em",
						"title":"text|.detail dl dd h1",
						"episode-link":"attrs.href|.time_con .swiper-slide .clear li a",
					},
				},
				AddQueue:[]*model.Request{
					{
						Method:      "get",
						Url:         "{$episode-link}",
						ProcessName: "episodeinfo",
					},
				},
			},
			{
				Name:"episodeinfo",
				Type:"template",
				TemplateRule:model.TemplateRule{
					Rule:map[string]string{
						"url":"attr.href|link[rel=\"canonical\"]",
						"title":"text|#intro2 h1",
						"player":"attr.src|.player_main iframe",
					},
				},
			},
		},

		Pipline:"file",
	}


	app := spider.New()
	app.AddSpider(spider2.InitWithTask(task))
	app.Run()

}
  1. 纯代码编写
type Movies struct {
	Datas []Movie `json:"data"`
}
type Movie struct {
	Rate  string   `json:"rate"`
	Start string   `json:"start"`
	Id    string   `json:"id"`
	Url   string   `json:"url"`
	Title string   `json:"title"`
	Cover string   `json:"cover"`
	Casts []string `json:"casts"`
}

type PageProcess struct{}

func (p *PageProcess) Process(context model.Context) (*model.Page, error) {
	movies := Movies{}
	if err := json.Unmarshal(context.Body, &movies); err != nil {
		return nil, err
	}
	page := &model.Page{}
	for _, movie := range movies.Datas {
		page.AddResult(movie)
	}
	return page, nil
}

func main() {
	sp := &spider2.Spider{}
	sp.Name = "douban-movie-code"
	sp.Id = "douban-movie-code"
	sp.Requests = []*model.Request{
		{
			Method:      "get",
			Url:         "https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}",
			ProcessName: "movie",
		},
	}
	sp.AddProcess("movie", &PageProcess{})
	sp.Pipline = file.NewFilePipline("./")

	app := spider.New()
	app.AddSpider(sp)
	app.Run()
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].