All Projects → kong36088 → Zhihuspider

kong36088 / Zhihuspider

多线程知乎用户爬虫,基于python3

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Zhihuspider

Decryptlogin
APIs for loginning some websites by using requests.
Stars: ✭ 1,861 (+825.87%)
Mutual labels:  zhihu, crawler, spider
Zhihu Crawler
zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目
Stars: ✭ 890 (+342.79%)
Mutual labels:  zhihu, crawler, spider
Zhihu Login
知乎模拟登录,支持提取验证码和保存 Cookies
Stars: ✭ 340 (+69.15%)
Mutual labels:  zhihu, crawler, spider
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+701.49%)
Mutual labels:  multi-threading, crawler, spider
Js Reverse
JS逆向研究
Stars: ✭ 159 (-20.9%)
Mutual labels:  crawler, spider
Fooproxy
稳健高效的评分制-针对性- IP代理池 + API服务,可以自己插入采集器进行代理IP的爬取,针对你的爬虫的一个或多个目标网站分别生成有效的IP代理数据库,支持MongoDB 4.0 使用 Python3.7(Scored IP proxy pool ,customise proxy data crawler can be added anytime)
Stars: ✭ 195 (-2.99%)
Mutual labels:  crawler, spider
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-7.46%)
Mutual labels:  crawler, spider
Proxy pool
Python爬虫代理IP池(proxy pool)
Stars: ✭ 13,964 (+6847.26%)
Mutual labels:  crawler, spider
Jlitespider
A lite distributed Java spider framework :-)
Stars: ✭ 151 (-24.88%)
Mutual labels:  crawler, spider
Fun crawler
Crawl some picture for fun
Stars: ✭ 169 (-15.92%)
Mutual labels:  crawler, spider
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-14.93%)
Mutual labels:  crawler, spider
Yispider
一款分布式爬虫平台,帮助你更好的管理和开发爬虫。 内置一套爬虫定义规则(模版),可使用模版快速定义爬虫,也可当作框架手动开发爬虫。(兴趣使然的项目,用的不爽了就更新)
Stars: ✭ 158 (-21.39%)
Mutual labels:  crawler, spider
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+875.62%)
Mutual labels:  crawler, spider
Scrapingoutsourcing
ScrapingOutsourcing专注分享爬虫代码 尽量每周更新一个
Stars: ✭ 164 (-18.41%)
Mutual labels:  crawler, spider
Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+959.2%)
Mutual labels:  crawler, spider
Gain
Web crawling framework based on asyncio.
Stars: ✭ 2,002 (+896.02%)
Mutual labels:  crawler, spider
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (-9.45%)
Mutual labels:  crawler, spider
Ncov2019 data crawler
疫情数据爬虫,2019新型冠状病毒数据仓库,轨迹数据,同乘数据,报道
Stars: ✭ 175 (-12.94%)
Mutual labels:  crawler, spider
Lianjia Beike Spider
链家网和贝壳网房价爬虫,采集北京上海广州深圳等21个中国主要城市的房价数据(小区,二手房,出租房,新房),稳定可靠快速!支持csv,MySQL, MongoDB,Excel, json存储,支持Python2和3,图表展示数据,注释丰富 ,点星支持,仅供学习参考,请勿用于商业用途,后果自负。
Stars: ✭ 2,257 (+1022.89%)
Mutual labels:  crawler, spider
Ok ip proxy pool
🍿爬虫代理IP池(proxy pool) python🍟一个还ok的IP代理池
Stars: ✭ 196 (-2.49%)
Mutual labels:  crawler, spider

ZhihuSpider

User spider for www.zhihu.com

1.Install python3 and packages

Make sure you have installed python3. Using pip to install dependencies.

pip install Image requests beautifulsoup4 html5lib redis PyMySQL 

2.Database Config

Install mysql,create your database. Import init.sql to create your table.

3.Install redis

# (ubuntu)
apt-get install redis

# or (centos)

yum install redis

# or (macos)
brew install redis

4.Config your application

Complete config.ini

5.Get start

python get_user.py

# or command python3

python3 get_user.py

中文

在我的博客里有代码的详细解读:我用python爬了知乎一百万用户的数据

数据统计分析:百万知乎用户数据分析

这是一个多线程抓取知乎用户的程序

Requirements

需要用到的包: beautifulsoup4 html5lib image requests redis PyMySQL

pip安装所有依赖包:

pip install Image requests beautifulsoup4 html5lib redis PyMySQL 

运行环境需要支持中文

测试运行环境python3.5,不保证其他运行环境能完美运行

1.需要安装mysql和redis

2.配置config.ini文件,设置好mysql和redis,并且填写你的知乎帐号(master分支新版爬虫不需要登陆,但是可能会有时效问题,可以切换至new-ui分支使用)

可以通过配置config.ini文件下的[sys] sleep_time 控制爬虫速度(尽量使用推荐值,过快容易被知乎封禁),thread_num配置线程数目

3.向数据库导入init.sql

Run

开始抓取数据:python get_user.py 查看抓取数量:python check_redis.py

效果

效果图1 效果图2

Docker

嫌麻烦的可以参考一下我用docker简单的搭建一个基础环境: mysql和redis都是官方镜像

docker run --name mysql -itd mysql:latest
docker run --name redis -itd redis:latest

再利用docker-compose运行python镜像,我的python的docker-compose.yml:

python:
    container_name: python
    build: .
    ports:
      - "84:80"
    external_links:
      - memcache:memcache
      - mysql:mysql
      - redis:redis
    volumes:
      - /docker_containers/python/www:/var/www/html
    tty: true
    stdin_open: true
    extra_hosts:
      - "python:192.168.102.140"
    environment:
        PYTHONIOENCODING: utf-8

我的Dockerfile:

From kong36088/zhihu-spider:latest

捐赠

您的支持是对我的最大鼓励! 谢谢你请我吃糖 wechatpay alipay

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].