Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Stars: ✭ 38 (-85.21%)

Mutual labels: spider, scrapy

ip proxy pool

Generating spiders dynamically to crawl and check those free proxy ip on the internet with scrapy.

Stars: ✭ 39 (-84.82%)

Mutual labels: spider, scrapy

scrapy-admin

A django admin site for scrapy

Stars: ✭ 44 (-82.88%)

Mutual labels: spider, scrapy

Scrapy IPProxyPool

免费 IP 代理池。Scrapy 爬虫框架插件

Stars: ✭ 100 (-61.09%)

Mutual labels: spider, scrapy

douban-spider

基于Scrapy框架的豆瓣电影爬虫

Stars: ✭ 25 (-90.27%)

Mutual labels: spider, scrapy

NScrapy

NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider

Stars: ✭ 88 (-65.76%)

Mutual labels: spider, scrapy

scrapy facebooker

Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.

Stars: ✭ 22 (-91.44%)

Mutual labels: spider, scrapy

V2EX Spider

V2EX爬虫

Stars: ✭ 21 (-91.83%)

Mutual labels: spider, scrapy

elves

🎊 Design and implement of lightweight crawler framework.

Stars: ✭ 322 (+25.29%)

Mutual labels: spider, scrapy

Douban Crawler

Uno Crawler por https://douban.com

Stars: ✭ 13 (-94.94%)

Mutual labels: spider, scrapy

163Music

163music spider by scrapy.

Stars: ✭ 60 (-76.65%)

Mutual labels: spider, scrapy

photo-spider-scrapy

10 photo website spiders, 10 个国外图库的 scrapy 爬虫代码

Stars: ✭ 17 (-93.39%)

Mutual labels: spider, scrapy

devsearch

A web search engine built with Python which uses TF-IDF and PageRank to sort search results.

Stars: ✭ 52 (-79.77%)

Mutual labels: spider, scrapy

python-spider

python爬虫小项目【持续更新】【笔趣阁小说下载、Tweet数据抓取、天气查询、网易云音乐逆向、天天基金网查询、微博数据抓取（生成cookie）、有道翻译逆向、企查查免登陆爬虫、大众点评svg加密破解、B站用户爬虫、拉钩免登录爬虫、自如租房字体加密、知乎问答

Stars: ✭ 45 (-82.49%)

Mutual labels: spider, scrapy

Web-Iota

Iota is a web scraper which can find all of the images and links/suburls on a webpage

Stars: ✭ 60 (-76.65%)

Mutual labels: spider, scrapy

small-spider-project

日常爬虫

Stars: ✭ 14 (-94.55%)

Mutual labels: spider, scrapy

Scrapy-Spiders

一个基于Scrapy的数据采集爬虫代码库

Stars: ✭ 34 (-86.77%)

Mutual labels: spider, scrapy

PttImageSpider

PTT 圖片下載器 (抓取整個看板的圖片，並用文章標題作為資料夾的名稱 ) (使用Scrapy)

Stars: ✭ 16 (-93.77%)

Mutual labels: spider, scrapy

View All Similar Projects ➔

Tieba_Spider

Readme(EN)

贴吧爬虫。

依赖参考

Python >= 3.6

mysql >= 5.5

beautifulsoup4 >= 4.6.0

scrapy >= 2.4

mysqlclient >= 1.3.10

使用方法

先打开config.json文件，在其中配置好数据库的域名、用户名和密码。接着直接运行命令即可：

scrapy run <贴吧名> <数据库名> <选项>

其中贴吧名不含末尾的“吧”字，而数据库名则是要存入的数据库名字，数据库在爬取前会被创建。例如

scrapy run 仙五前修改 Pal5Q_Diy

但若要在控制台输入中文(非ASCII字符)，请确保控制台编码为UTF8。

若在config.json里面已经配置好贴吧名和对应数据库名，则可以忽略数据库名。若忽略贴吧名，则爬取config.json里面DEFAULT的数据库。

特别提醒 任务一旦断开，不可继续进行。因此SSH打开任务时，请保证不要断开连接，或者考虑使用后台任务或者screen命令等。

选项说明

短形式	长形式	参数个数	作用	举例
-p	--pages	2	设定爬取帖子的开始页和结束页	scrapy run ... -p 2 5
-g	--good_only	0	只爬精品帖	scrapy run ... -g
-s	--see_lz	0	只看楼主，即不爬非楼主的楼层	scrapy run ... -s
-f	--filter	1	设定帖子过滤函数名(见`filter.py`)	scrapy run ... -f thread_filter

举例：

scrapy run 仙剑五外传 -gs -p 5 12 -f thread_filter

使用只看楼主模式爬仙剑五外传吧精品帖中第5页到第12页的帖子，其中能通过过滤器filter.py中的thread_filter函数的帖子及其内容会被存入数据库。

数据处理

对爬取的数据并非原样入库，会进行一些处理。

广告楼层会被去掉(右下角有“广告”两字的楼层)。
加粗和红字效果丢失为纯文本(beautifulsoup的get_text功能)。
常用表情会转换为文字表达(emotion.json，欢迎补充)。
图片和视频会变成对应链接(要获取视频链接需要拿到一个302响应)。

数据保存结构

thread

为各帖子的一些基本信息。

属性	类型	备注
id	BIGINT(12)	"http://tieba.baidu.com/p/4778655068" 的ID就是4778655068
title	VARCHAR(100)
author	VARCHAR(30)
reply_num	INT(4)	回复数量(含楼中楼, 不含1楼)
good	BOOL	是否为精品帖

post

为各楼层的一些基本信息，包括1楼。

属性	类型	备注
id	BIGINT(12)	楼层也有对应ID
floor	INT(4)	楼层编号
author	VARCHAR(30)
content	TEXT	楼层内容
time	DATETIME	发布时间
comment_num	INT(4)	楼中楼回复数量
thread_id	BIGINT(12)	楼层的主体帖子ID，外键

comment

楼中楼的一些信息。

属性	类型	备注
id	BIGINT(12)	楼中楼也有ID，且和楼层共用
author	VARCHAR(30)
content	TEXT	楼中楼内容
time	DATETIME	发布时间
post_id	BIGINT(12)	楼中楼的主体楼层ID，外键

爬取方式决定了comment有可能先于对应的post被爬取，从而外键错误。因此任务开始阶段数据库的外键检测会被关闭。

耗时参考

耗时和服务器带宽以及爬取时段有关，下面是我的阿里云服务器对几个贴吧的爬取用时，仅供参考。

贴吧名	帖子数	回复数	楼中楼数	用时(秒)
pandakill	3638	41221	50206	222.2
lyingman	11290	122662	126670	718.9
仙剑五外传	67356	1262705	807435	7188

下面几个吧是同一时刻爬取的：

贴吧名	帖子数	回复数	楼中楼数	用时(秒)
仙五前修改	530	3518	7045	79.02
仙剑3高难度	2080	21293	16185	274.6
古剑高难度	1703	26086	32941	254.0

特别提醒 请注意下爬取数据的占用空间，别把磁盘占满了。

更新日志

更新后请先删除原有的日志spider.log。

2020-08-09更新：解决了只爬楼中楼前10层的问题。注：由于python官方已放弃对python 2的支持，此后版本将不再保证python 2能正常运行。

2020-02-23更新：解决了被百度识别为爬虫返回403的问题。

2018-06-13更新：新增支持python 3。请卸载原来的python库mysql-python，改为使用mysqlclient。

2017-03-23更新：修改了页选项参数形式，增加了只看楼主、只爬精品和自定义过滤帖子功能。

参考文献

Scrapy 1.0 文档

Scrapy 源代码

Beautiful Soup的用法

Ubuntu/Debian 安装lxml的正确方式

Twisted adbapi 源代码

mysql升级8.0后遇到的坑

有什么问题或建议欢迎到我的主页留言~

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 257

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗