All Projects → brianway → Webporter

brianway / Webporter

基于 webmagic 的 Java 爬虫应用

Programming Languages

java
68154 projects - #9 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to Webporter

Wazuh Kibana App
Wazuh - Kibana plugin
Stars: ✭ 212 (-91.84%)
Mutual labels:  elasticsearch, kibana
Json Logging Python
Python logging library to emit JSON log that can be easily indexed and searchable by logging infrastructure such as ELK, EFK, AWS Cloudwatch, GCP Stackdriver
Stars: ✭ 143 (-94.5%)
Mutual labels:  elasticsearch, kibana
Elastic Stack
Aprenda Elasticsearch, Logstash, Kibana e Beats do jeito mais fácil ⭐️
Stars: ✭ 135 (-94.8%)
Mutual labels:  elasticsearch, kibana
Docker Elastic
Deploy Elastic stack in a Docker Swarm cluster. Ship application logs and metrics using beats & GELF plugin to Elasticsearch
Stars: ✭ 202 (-92.22%)
Mutual labels:  elasticsearch, kibana
Microservices Sample
Sample project to create an application using microservices architecture
Stars: ✭ 167 (-93.57%)
Mutual labels:  elasticsearch, kibana
Vagrant Elastic Stack
Giving the Elastic Stack a try in Vagrant
Stars: ✭ 131 (-94.96%)
Mutual labels:  elasticsearch, kibana
Terraform Aws Elasticsearch
Terraform module to provision an Elasticsearch cluster with built-in integrations with Kibana and Logstash.
Stars: ✭ 137 (-94.73%)
Mutual labels:  elasticsearch, kibana
Elastic Docker
Example setups for Elasticsearch, Kibana, Logstash, and Beats with docker-compose
Stars: ✭ 118 (-95.46%)
Mutual labels:  elasticsearch, kibana
Synesis lite suricata
Suricata IDS/IPS log analytics using the Elastic Stack.
Stars: ✭ 167 (-93.57%)
Mutual labels:  elasticsearch, kibana
Elk Docker
Elasticsearch, Logstash, Kibana (ELK) Docker image
Stars: ✭ 1,973 (-24.06%)
Mutual labels:  elasticsearch, kibana
Sigmaui
SIGMA UI is a free open-source application based on the Elastic stack and Sigma Converter (sigmac)
Stars: ✭ 123 (-95.27%)
Mutual labels:  elasticsearch, kibana
Mirage
🎨 GUI for simplifying Elasticsearch Query DSL
Stars: ✭ 2,143 (-17.51%)
Mutual labels:  elasticsearch, kibana
Elastic
Elastic Stack (6.2.4) 을 활용한 Dashboard 만들기 Project
Stars: ✭ 121 (-95.34%)
Mutual labels:  elasticsearch, kibana
Docker Elk
The Elastic stack (ELK) powered by Docker and Compose.
Stars: ✭ 12,327 (+374.48%)
Mutual labels:  elasticsearch, kibana
Elassandra
Elassandra = Elasticsearch + Apache Cassandra
Stars: ✭ 1,610 (-38.03%)
Mutual labels:  elasticsearch, kibana
Elk Hole
elasticsearch, logstash and kibana configuration for pi-hole visualiziation
Stars: ✭ 136 (-94.77%)
Mutual labels:  elasticsearch, kibana
Redelk
Red Team's SIEM - tool for Red Teams used for tracking and alarming about Blue Team activities as well as better usability in long term operations.
Stars: ✭ 1,692 (-34.87%)
Mutual labels:  elasticsearch, kibana
Detectlm
Detecting Lateral Movement with Machine Learning
Stars: ✭ 117 (-95.5%)
Mutual labels:  elasticsearch, kibana
Elk Stack
ELK Stack ... based on Elastic Stack 5.x
Stars: ✭ 148 (-94.3%)
Mutual labels:  elasticsearch, kibana
Docker Elastic Stack
ELK Stack Dockerfile
Stars: ✭ 175 (-93.26%)
Mutual labels:  elasticsearch, kibana

webporter

webporter 是一个基于垂直爬虫框架 webmagic 的 Java 爬虫应用,旨在提供一套完整的数据爬取,持久化存储和可视化展示的实践样例。

webporter 寓意“我们不生产数据,我们只是互联网的搬运工~”

如果觉得不错,请先在这个仓库上点个 star 吧,这也是对我的肯定和鼓励,谢谢了。

目前只提供了知乎用户数据的爬虫示例。不定时进行调整和补充,需要关注更新的请 watch、star、fork


webporter 的主要特色:

  • 基于国产 Java 爬虫框架 webmagic,是众多 Python 爬虫中的一股清流
  • 完全模块化的设计,强大的可扩展性
  • 核心简单,但是涵盖爬虫应用的完整流程,是爬虫应用的实践样例
  • 使用 JSON 配置,无需改动源码
  • 支持多线程
  • 支持向 Elasticsearch 批量导入

注意:webporter 不是爬虫框架,而是如何使用爬虫框架进行实战的样例,偏休闲性质,不建议使用在生产环境。 生产环境建议使用 webmagic 或者 scrapy

webporter 核心模块的架构和设计主要参考了 webmagic https://github.com/code4craft/webmagic

webporter 的 github 地址:https://github.com/brianway/webporter

效果展示

详细的数据分析文章请看我的博客 《爬取知乎60万用户信息之后的简单分析》

  • 下载数据:去重导入 Elasticsearch 后大概有 60+ 万用户数据(目前没有遇到反爬限制)

索引状态图

  • 示例分析:通过聚合得到知乎用户 top 10 行业分布情况(1:男,0:女,-1:未知)

top 10 行业分布

仓库目录

环境要求

  • JDK 1.8+
  • Maven 3.3+
  • Elasticsearch 5.0.1
  • Kibana 5.0.1

新手可参考我的博客 《Elasticsearch 5.0-安装使用》快速上手 Elasticsearch+Kibana

快速开始

以爬取知乎用户数据为例

1.定制配置文件

配置文件位于 webporter-collector-zhihu/src/main/resources/config.json, 示例:

{
  "site": {
    "domain": "www.zhihu.com",
    "headers": {
      "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36",
      "authorization": "Your own authorization here."
    },
    "retryTimes": 3,
    "sleepTime": 500
  },
  "base_dir": "/Users/brian/todo/data/zhihu/"
}

仅需要修改两处:authorizationbase_dir 即可

  • authorization: 需要知乎账户在已登录状态下自行在浏览器抓包提取该 HTTP 响应头。若有疑问请参考 issue 3
  • base_dir: 为保存数据文件的根目录,需具有写权限

配完就可以直接使用了。更多关于 site 的属性配置请参考 WebMagic in Action - Site Config

2.启动爬虫

依次运行 webporter-collector-zhihu 模块的下面两个类的 main 方法即可。(注意:由于这两个阶段是串行的,不要同时启动这两个类)

3.可视化

安装好 ElasticsearchKibana 后,在 Kibana 中使用 Visualize 对数据可视化即可

赞助

如果您觉得该项目对您有帮助,请扫描下方二维码对我进行鼓励,以便我更好的维护和更新,谢谢支持!

支付宝 微信

TODO

  • 数据爬取,获取知乎用户数据
  • 数据持久化,将数据导入到 Elasticsearch 中
  • 可视化展示,通过前端框架对数据进行简单的分析和展示
  • 使用 Java 8 新特性完善代码
  • Dockerize 这个仓库,方便用户直接使用

联系作者

Email: [email protected]

Lisence

Lisenced under Apache 2.0 lisence

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].