Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wycm → Zhihu Crawler

wycm / Zhihu Crawler

Licence: other

zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目

Programming Languages

68154 projects - #9 most used programming language

Labels

crawler spider zhihu

Projects that are alternatives of or similar to Zhihu Crawler

APIs for loginning some websites by using requests.

Stars: ✭ 1,861 (+109.1%)

Mutual labels: zhihu, crawler, spider

多线程知乎用户爬虫，基于python3

Stars: ✭ 201 (-77.42%)

Mutual labels: zhihu, crawler, spider

知乎模拟登录，支持提取验证码和保存 Cookies

Stars: ✭ 340 (-61.8%)

Mutual labels: zhihu, crawler, spider

Gospider - Fast web spider written in Go

Stars: ✭ 785 (-11.8%)

Mutual labels: crawler, spider

The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.

Stars: ✭ 532 (-40.22%)

Mutual labels: crawler, spider

A Facebook crawler

Stars: ✭ 536 (-39.78%)

Mutual labels: crawler, spider

Python的基础练习代码与各种爬虫代码

Stars: ✭ 451 (-49.33%)

Mutual labels: crawler, spider

Dark Web OSINT Tool

Stars: ✭ 821 (-7.75%)

Mutual labels: crawler, spider

NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。

Stars: ✭ 573 (-35.62%)

Mutual labels: crawler, spider

A high performance web crawler in Elixir.

Stars: ✭ 781 (-12.25%)

Mutual labels: crawler, spider

A multi-thread crawler framework with many builtin image crawlers provided.

Stars: ✭ 629 (-29.33%)

Mutual labels: crawler, spider

带你了解一下Golang的市场行情

Stars: ✭ 526 (-40.9%)

Mutual labels: crawler, spider

💖 High available distributed ip proxy pool, powerd by Scrapy and Redis

Stars: ✭ 4,993 (+461.01%)

Mutual labels: crawler, spider

A distributed web crawler framework.（分布式爬虫框架XXL-CRAWLER）

Stars: ✭ 561 (-36.97%)

Mutual labels: crawler, spider

Awesome Crawler

A collection of awesome web crawler,spider in different languages

Stars: ✭ 4,793 (+438.54%)

Mutual labels: crawler, spider

API of DouYin for Humans used to Crawl Popular Videos and Musics

Stars: ✭ 580 (-34.83%)

Mutual labels: crawler, spider

Baiduimagespider

一个超级轻量的百度图片爬虫

Stars: ✭ 591 (-33.6%)

Mutual labels: crawler, spider

🐾 Creeper - The Next Generation Crawler Framework (Go)

Stars: ✭ 762 (-14.38%)

Mutual labels: crawler, spider

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Stars: ✭ 680 (-23.6%)

Mutual labels: crawler, spider

Html网页正文提取

Stars: ✭ 441 (-50.45%)

Mutual labels: crawler, spider

View All Similar Projects ➔

知乎爬虫

zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式抓取爬虫项目，主要功能是抓取知乎用户、话题、问题、答案、文章等数据，如果觉得不错，请给个star。

爬取结果

下图为爬取117w知乎用户数据的简单统计
详细统计见 https://www.vwycm.cn/zhihu/charts

需要

jdk 1.8
redis
mongodb

快速开始

修改zhihu/src/main/resources/application.yamlredis、mongodb相关配置，application.yaml
初始化zhihu/src/main/resources/mongo-init.sqlmongodb脚步，mongo-init.sql
设置日志路径，默认在/var/www/logslogback-spring.xml
Run with ZhihuCrawlerApplication.java

使用到的接口

地址(url)：https://www.zhihu.com/api/v4/members/${userid}/followees
请求类型：GET
请求参数

参数名	类型	必填	值	说明
include	String	是	`data[*]answer_count,articles_count`	需要返回的字段（这个值可以改根据需要增加一些字段，见如下示例url）
offset	int	是	0	偏移量（通过调整这个值可以获取到一个用户的`所有关注用户`资料）
limit	int	是	20	返回用户数（最大20，超过20无效）

url示例：https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees?include=data[*].educations,employments,answer_count,business,locations,articles_count,follower_count,gender,following_count,question_count,voteup_count,thanked_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20
响应：json数据，会有关注用户资料

特性

大量使用http代理，突破同一个客户端访问量限制（注：使用的都是网上公开的免费代理，近期测试来看，部分免费代理网站都做了反爬，可用的免费代理比以前少了很多，抓取速度相比以前慢了很多）。
支持持久化(mongodb)。
多线程、高性能、支持横向扩展分布式爬取。

TODO

新增问题、答案、文章抓取
支持实时抓取，每小时更新知乎全站所有热门内容

更新

2019.02.21

基于Spring Boot重构项目，支持横向扩展，分布式抓取
数据持久化采用mongodb
采用基于Netty的AsyncHttpClient代替HttpClient4.5

2018.07.09

知乎网站更新，不再需要authorization验证
完善单测
修复已知bug

2017.11.05

知乎authorization文件更新，修改authorization获取方式。

2017.05.26

修复代理返回错误数据，导致java.lang.reflect.UndeclaredThrowableException异常。

2017.03.30

知乎api变更，关注列表页不能获取到关注人数，导致线程池任务不能持续下去。抓取模式切换成原来ListPageThreadPool和DetailPageThreadPool的方式。

2017.01.17

增加代理序列化。
调整项目结构，大幅度提高爬取速度。不再使用ListPageThreadPool和DetailPageThreadPool的方式。直接下载关注列表页，可以直接获取到用户详细资料。

2017.01.10

不再采用登录抓取，并移除登录抓取相关模块，模拟登录的主要逻辑代码见ModelLogin.java。
优化项目结构，加快爬取速度。采用ListPageThreadPool和DetailPageThreadPool两个线程池。ListPageThreadPool负责下载”关注用户“列表页，解析出关注用户，将关注用户的url去重，然后放到DetailPageThreadPool线程池。 DetailPageThreadPool负责下载用户详情页面，解析出用户基本信息并入库，获取该用户的"关注用户"的列表页url并放到ListPageThreadPool。

2016.12.26

移除未使用的包，修复ConcurrentModificationException和NoSuchElementException异常问题。
增加游客（免登录）模式抓取。
增加代理抓取模块。

免责申明

本项目仅供个人学习与交流使用，严禁用于商业以及不良用途。

最后

有问题的请提issue。
欢迎贡献代码。
爬虫交流群：633925314，欢迎交流。
需要数据的，关注公众号即可(117w知乎用户基本信息资料，该数据仅供个人学习与交流使用，严禁用于商业以及不良用途)：lwndso

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 890

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗