All Projects → AnyISalIn → Zhihu_fun

AnyISalIn / Zhihu_fun

基于 Selenium 的知乎关键词爬虫

Programming Languages

javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Zhihu fun

Pythonspidernotes
Python入门网络爬虫之精华版
Stars: ✭ 5,634 (+2945.41%)
Mutual labels:  zhihu, selenium
Autocrawler
Google, Naver multiprocess image web crawler (Selenium)
Stars: ✭ 957 (+417.3%)
Mutual labels:  crawler, selenium
Price Monitor
京东商品价格监控:监控用户设定商品价格,降价邮件/微信提醒。技术:Python爬虫/IP代理池/JS接口爬取/Selenium页面爬取
Stars: ✭ 634 (+242.7%)
Mutual labels:  crawler, selenium
Zhihu Login
知乎模拟登录,支持提取验证码和保存 Cookies
Stars: ✭ 340 (+83.78%)
Mutual labels:  zhihu, crawler
Examples Of Web Crawlers
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )
Stars: ✭ 10,724 (+5696.76%)
Mutual labels:  crawler, selenium
Instagramcrawler
A non API python program to crawl public photos, posts or followers
Stars: ✭ 349 (+88.65%)
Mutual labels:  crawler, selenium
Zhihu Crawler
zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目
Stars: ✭ 890 (+381.08%)
Mutual labels:  zhihu, crawler
Zhihuspider
多线程知乎用户爬虫,基于python3
Stars: ✭ 201 (+8.65%)
Mutual labels:  zhihu, crawler
Instagram Profilecrawl
💻 Quickly crawl the information (e.g. followers, tags, etc...) of an instagram profile. No login required!
Stars: ✭ 110 (-40.54%)
Mutual labels:  crawler, selenium
Amazonrobot
Amazon商品引流的 python 爬虫
Stars: ✭ 97 (-47.57%)
Mutual labels:  crawler, selenium
bots-zoo
No description or website provided.
Stars: ✭ 59 (-68.11%)
Mutual labels:  crawler, selenium
Instagram Bot
An Instagram bot developed using the Selenium Framework
Stars: ✭ 138 (-25.41%)
Mutual labels:  crawler, selenium
lostark-wait-notifier
🐤️ Lost Ark wait notifier
Stars: ✭ 38 (-79.46%)
Mutual labels:  crawler, selenium
Netdiscovery
NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。
Stars: ✭ 573 (+209.73%)
Mutual labels:  crawler, selenium
zhihu-crawler
徒手实现定时爬取知乎,从中发掘有价值的信息,并可视化爬取的数据作网页展示。
Stars: ✭ 56 (-69.73%)
Mutual labels:  selenium, zhihu
Instagram Profilecrawl
📝 quickly crawl the information (e.g. followers, tags etc...) of an instagram profile.
Stars: ✭ 816 (+341.08%)
Mutual labels:  crawler, selenium
Pychromeless
Python Lambda Chrome Automation (naming pending)
Stars: ✭ 219 (+18.38%)
Mutual labels:  crawler, selenium
Awesome Java Crawler
本仓库收集整理爬虫相关资源,开发语言以Java为主
Stars: ✭ 228 (+23.24%)
Mutual labels:  crawler, selenium
Zhihuvapi
优雅地玩知乎
Stars: ✭ 67 (-63.78%)
Mutual labels:  zhihu, crawler
Decryptlogin
APIs for loginning some websites by using requests.
Stars: ✭ 1,861 (+905.95%)
Mutual labels:  zhihu, crawler

zhihu_fun

基于 Selenium 的知乎关键词爬虫,仅支持 Python 3

Demo

web_demo

keyword_demo

result_demo

data_demo

安装配置

安装 phantomjs

zhihu_fun 依赖 phantomjs, 且版本必须大于 2.1

$ wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2

$ tar xf phantomjs-2.1.1-linux-x86_64.tar.bz2 -C /opt/

$ ln -sv /opt/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/bin/ # 确保 phantomjs 在 system PATH 路径下

配置 Nginx

为什么要使用 Nginx, 其实也可以不用,原因请看这个 issue

readme 对小白如我不太友好 #5

# 确保 autoindex 和 charset 被正确配置
server {
        listen 80 default_server;
        listen [::]:80 default_server ipv6only=on;

        root /usr/share/nginx/html/zhihu_fun;
        autoindex on;
        index index.html index.htm;
        charset UTF-8;
        # Make site accessible from http://localhost/
        server_name localhost;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;
                # Uncomment to enable naxsi on this location
                # include /etc/nginx/naxsi.rules
        }
    }

获取 Cookie

正常登陆 zhihu, 通过浏览器开发者工具中的 network 选项,获取 Cookie

get_cookie

配置运行 zhihu_fun

$ python
Python 3.5.3 # 仅支持 Python

$ git clone https://github.com/anyisalin/zhihu_fun.git /usr/share/nginx/html/zhihu_fun

$ cd /usr/share/nginx/html/zhihu_fun

$ vim go.html # 修改 <base href="http://localhost:8000"> 的地址为你当前的地址

$ pip install -r requirements.txt # 安装依赖

$ vim zhihu_fun/config.py # 修改 Cookie 为你的 Cookie, 或者修改其他配置

$ python run.py # 运行爬虫

配置选项

配置文件为 zhihu_fun/config.py

config = {
    'start_url': 'https://www.zhihu.com/search?type=content&q=%E7%BE%8E%E8%85%BF',  # 爬虫的起始路径,如果没有设置,则为 zhihu 主页
    # 'start_url': '',
    'cookie': 'You Cookie', # 登录知乎,复制浏览器的 Cookie
    'root_url': 'https://www.zhihu.com',
    'log_level': 'info',  # support debug, info, warn
    'custom_urls': ['https://www.zhihu.com/search?type=content&q=%E7%BE%8E+%E7%BE%8E%E5%A5%B3', # 支持提供自定义的 URL
                    'https://www.zhihu.com/topic/19552207/hot',
                    'https://www.zhihu.com/question/51603251',
                    'https://www.zhihu.com/question/51644416',
                    'https://www.zhihu.com/topic/20011035/hot'],
    'keyword': ['美女', '萌', '女生', '腿长', '女性',    # 根据问题标题匹配,再根据 key_number 的值,来判定匹配多少个关键词加入待爬队列
                '日系', '可爱', '女神', '美腿', '成长',
                '炼成', '吸引', '美', '健身', '丝袜',
                '容貌', '拍照', '女生', '漂亮', '颜值',
                '搭配', '长得', '好看', '衣服', '姑娘',
                '穿', '俗气', '风格', '眼睛', '锻炼',
                '感觉', '感受', '长的', '大学生'],
    'blacklist': ['男生', '男性', '伪娘', '男友', '男人', '男朋友'], # 黑名单,如果问题标题匹配到黑名单中的词,则直接不匹配
    'key_number': 2,
    'vote_up': 10,  # 根据答案的赞同数来判定是否爬取图片
    'url_generate_time': 30  # 设置 url generate 运行的时间, 设置为 None 代表一直跑下去, 不能为 '', ""
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].