All Projects → hu17889 → Go_spider

hu17889 / Go_spider

Licence: mpl-2.0
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Go spider

Nodespider
[DEPRECATED] Simple, flexible, delightful web crawler/spider package
Stars: ✭ 33 (-98.11%)
Mutual labels:  crawler, spider, pipeline
Baiduspider
BaiduSpider,一个爬取百度搜索结果的爬虫,目前支持百度网页搜索,百度图片搜索,百度知道搜索,百度视频搜索,百度资讯搜索,百度文库搜索,百度经验搜索和百度百科搜索。
Stars: ✭ 105 (-93.98%)
Mutual labels:  crawler, spider
Not Your Average Web Crawler
A web crawler (for bug hunting) that gathers more than you can imagine.
Stars: ✭ 107 (-93.87%)
Mutual labels:  crawler, spider
Bilibili member crawler
B站用户爬虫 好耶~是爬虫
Stars: ✭ 115 (-93.41%)
Mutual labels:  crawler, spider
Mm131
MM131网站图片爬取 🚨
Stars: ✭ 129 (-92.61%)
Mutual labels:  crawler, spider
Crawler Detect
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
Stars: ✭ 1,549 (-11.23%)
Mutual labels:  crawler, spider
Douban Movie
Golang爬虫 爬取豆瓣电影Top250
Stars: ✭ 114 (-93.47%)
Mutual labels:  crawler, spider
Gopa Abandoned
GOPA, a spider written in Go.(NOTE: this project moved to https://github.com/infinitbyte/gopa )
Stars: ✭ 98 (-94.38%)
Mutual labels:  crawler, spider
Decryptlogin
APIs for loginning some websites by using requests.
Stars: ✭ 1,861 (+6.65%)
Mutual labels:  crawler, spider
Free proxy website
获取免费socks/https/http代理的网站集合
Stars: ✭ 119 (-93.18%)
Mutual labels:  crawler, spider
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (-7.68%)
Mutual labels:  crawler, spider
Skycaiji
蓝天采集器是一款免费的数据采集发布爬虫软件,采用php+mysql开发,可部署在云服务器,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Stars: ✭ 1,514 (-13.24%)
Mutual labels:  crawler, spider
Ruia
Async Python 3.6+ web scraping micro-framework based on asyncio
Stars: ✭ 1,366 (-21.72%)
Mutual labels:  crawler, spider
Digger
Digger is a powerful and flexible web crawler implemented by pure golang
Stars: ✭ 130 (-92.55%)
Mutual labels:  crawler, spider
Douyinsdk
抖音 SDK,数据采集,爬虫抓取不是梦
Stars: ✭ 99 (-94.33%)
Mutual labels:  crawler, spider
Pkulaw spider
爬取北大法宝网http://www.pkulaw.cn/Case/
Stars: ✭ 113 (-93.52%)
Mutual labels:  crawler, spider
Scrapy demo
all kinds of scrapy demo
Stars: ✭ 128 (-92.66%)
Mutual labels:  spider, pipeline
Puppeteer Walker
a puppeteer walker 🕷 🕸
Stars: ✭ 78 (-95.53%)
Mutual labels:  crawler, spider
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (-28.6%)
Mutual labels:  crawler, spider
Examples Of Web Crawlers
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )
Stars: ✭ 10,724 (+514.56%)
Mutual labels:  crawler, spider

go_spider

Build Status

A crawler of vertical communities achieved by GOLANG.

image

Latest stable Release: Version 1.2 (Sep 23, 2014).

  • go_spider讨论群 QQ群号:337344607

Features

  • Concurrent
  • Fit for vertical communities
  • Flexible, Modular
  • Native Go implementation
  • Can be expanded to an individualized crawler easily

Requirements

  • Go 1.2 or higher

Documentation

中文文档 && 常见问题.

Installation

go get github.com/hu17889/go_spider
go get github.com/PuerkitoBio/goquery
go get github.com/bitly/go-simplejson
go get golang.org/x/net/html/charset

This project is based on simplejson, goquery.

You can download packages from http://gopm.io/ in China.

Use example

Here is an example for crawling github content. You can have a try of the crawl process.

  • go install github.com/hu17889/go_spider/example/github_repo_page_processor
  • ./bin/github_repo_page_processor

More examples here: examples.

Make your spider

    // Spider input:
    //  PageProcesser ;
    //  Task name used in Pipeline for record;
    spider.NewSpider(NewMyPageProcesser(), "TaskName").
        AddUrl("https://github.com/hu17889?tab=repositories", "html"). // Start url, html is the responce type ("html" or "json")
        AddPipeline(pipeline.NewPipelineConsole()).                    // Print result on screen
        SetThreadnum(3).                                               // Crawl request by three Coroutines
        Run()
  • Use default modules

  • Downloader:HttpDownloader

  • Scheduler:QueueScheduler

  • Pipeline:PipelineConsole,PipelineFile

  • Use your modules

Just copy the default modules and modify it!

If you make a Downloader module, you can use it by Spider.SetDownloader(your_downloader).

If you make a Pipeline module, you can use it by Spider.AddPipeline(your_pipeline).

If you make a Scheduler module, you can use it by Spider.SetScheduler(your_scheduler).

Extensions

Extensions folder include modulers or other tools someone sharing. You can push your code without bugs.

Modulers

Spider

Summary: Crawler initialization, concurrent management, default moduler, moduler management, config setting.

Functions:

  • Clawler startup functions: Get, GetAll, Run
  • Add request: AddUrl, AddUrls, AddRequest, AddRequests
  • Set main moduler: AddPipeline(could have several pipeline modulers), SetScheduler, SetDownloader
  • Set config: SetExitWhenComplete, SetThreadnum(concurrent number), SetSleepTime(sleep time after one crawl)
  • Monitor: OpenFileLog, OpenFileLogDefault(open file log function, logged by mlog package), CloseFileLog, OpenStrace(open tracing info printed on screen by stderr), CloseStrace

Downloader

Summary: Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser. Html parsing is based on goquery package. Json parsing is based on simplejson package. Jsonp will be conversed to json. Text form represents plain text content without parser.

Functions:

  • Download: download content of the crawl objective. Result contains data body, header, cookies and request info.

PageProcesser

Summary: The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and urls to be crawled next step. These key-value pairs will be saved in PageItems and urls will be pushed in Scheduler.

Functions:

  • Process: parse the objective crawled.

Page

Summary: save information of request.

Functions:

  • Get result: GetJson, GetHtmlParser, GetBodyStr(plain text)
  • Get information of objective: GetRequest, GetCookies, GetHeader
  • Get Status of crawl process: IsSucc(Download success or not), Errormsg(Get error info in Downloader)
  • Set config:SetSkip, GetSkip(if skip is true, do not output result in Pipeline), AddTargetRequest, AddTargetRequests(Save urls to be crawled next stage), AddTargetRequestWithParams, AddTargetRequestsWithParams, AddField(Save key-value pairs after parsing)

Scheduler

Summary: The Scheduler moduler is a Request queue. Urls parsed in PageProcesser will be pushed in the queue.

Functions:

  • Push
  • Poll
  • Count

Pipeline

Summary: The Pipeline moduler will output the result and save wherever you want. Default moduler is PipelineConsole(Output to stdout) and PipelineFile(Output to file)

Functions:

  • Process

Request

Summary: The Request moduler has config for http request like url, header and cookies.

Functions:

  • Process

License

go_spider is licensed under the Mozilla Public License Version 2.0

Mozilla summarizes the license scope as follows:

MPL: The copyleft applies to any files containing MPLed code.

That means:

  • You can use the unchanged source code both in private as also commercial
  • You needn't publish the source code of your library as long the files licensed under the MPL 2.0 are unchanged
  • You must publish the source code of any changed files licensed under the MPL 2.0 under a) the MPL 2.0 itself or b) a compatible license (e.g. GPL 3.0 or Apache License 2.0)

Please read the MPL 2.0 FAQ if you have further questions regarding the license.

You can read the full terms here: LICENSE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].