All Projects → jerry-sc → webhunger

jerry-sc / webhunger

Licence: Apache-2.0 license
WebHunger is an extensible, full-scale crawler framework that supports distributed crawling, aiming at getting users focused on web page parsing without concerning for the crawling process.

Programming Languages

java
68154 projects - #9 most used programming language
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to webhunger

Pottery
Redis for humans. 🌎🌍🌏
Stars: ✭ 204 (+1100%)
Mutual labels:  distributed
Dweb.page
Your Gateway to the Distributed Web
Stars: ✭ 239 (+1305.88%)
Mutual labels:  distributed
Ray
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.
Stars: ✭ 18,547 (+109000%)
Mutual labels:  distributed
Vernemq
A distributed MQTT message broker based on Erlang/OTP. Built for high quality & Industrial use cases.
Stars: ✭ 2,628 (+15358.82%)
Mutual labels:  distributed
Brainiak
Brain Imaging Analysis Kit
Stars: ✭ 232 (+1264.71%)
Mutual labels:  distributed
Spring Boot Start Current
Spring Boot 脚手架 Mybatis Spring Security JWT 权限 Spring Cache + Redis
Stars: ✭ 246 (+1347.06%)
Mutual labels:  distributed
Scannerl
The modular distributed fingerprinting engine
Stars: ✭ 208 (+1123.53%)
Mutual labels:  distributed
celery-monitor
The celery monitor app was written by Django.
Stars: ✭ 92 (+441.18%)
Mutual labels:  distributed
Flambe
An ML framework to accelerate research and its path to production.
Stars: ✭ 236 (+1288.24%)
Mutual labels:  distributed
Cat
CAT 作为服务端项目基础组件,提供了 Java, C/C++, Node.js, Python, Go 等多语言客户端,已经在美团点评的基础架构中间件框架(MVC框架,RPC框架,数据库框架,缓存框架等,消息队列,配置系统等)深度集成,为美团点评各业务线提供系统丰富的性能指标、健康状况、实时告警等。
Stars: ✭ 16,236 (+95405.88%)
Mutual labels:  distributed
Ruby Spark
Ruby wrapper for Apache Spark
Stars: ✭ 221 (+1200%)
Mutual labels:  distributed
Coerce Rs
Coerce - an asynchronous (async/await) Actor runtime and cluster framework for Rust
Stars: ✭ 231 (+1258.82%)
Mutual labels:  distributed
Shardingsphere Elasticjob Cloud
Stars: ✭ 248 (+1358.82%)
Mutual labels:  distributed
Pysr
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing
Stars: ✭ 213 (+1152.94%)
Mutual labels:  distributed
Tensorflow
An Open Source Machine Learning Framework for Everyone
Stars: ✭ 161,335 (+948929.41%)
Mutual labels:  distributed
Gerapy
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Stars: ✭ 2,601 (+15200%)
Mutual labels:  distributed
Powerjob
Enterprise job scheduling middleware with distributed computing ability.
Stars: ✭ 3,231 (+18905.88%)
Mutual labels:  distributed
Multi-Node-TimescaleDB
The multi-node setup of TimescaleDB 🐯🐯🐯 🐘 🐯🐯🐯
Stars: ✭ 42 (+147.06%)
Mutual labels:  distributed
itc.lua
A Lua implementation of Interval Tree Clocks
Stars: ✭ 21 (+23.53%)
Mutual labels:  distributed
Sia
Blockchain-based marketplace for file storage. Project has moved to GitLab: https://gitlab.com/NebulousLabs/Sia
Stars: ✭ 2,731 (+15964.71%)
Mutual labels:  distributed

Developing Now......

Readme in Chinese

Motivation

Note: We call the crawler which crawl the entire site: A full-scale crawler.

Firstly, let’s summarize the various types of crawlers, their characteristics and the corresponding open source framework.

Crawler type Crawl num Forbidden risk Open source framework
Search engine crawler Unknown Low Nutch
Vertical crawler Known Moderate WebMagic SpiderMan
Full-scale crawler Unknown High None
  • Page number: If you can know what content we want and how much data we need before we crawl it, then it is easy to determine the performance of the crawler; On the contrary, it would get difficult when we do not know that. So it is a big problem for a full-scale crawler which try to crawl more pages.
  • Forbidden risk: The search engine crawlers crawl the entire Internet, so it is easy to avoid frequent crawling of a site through choosing different site’s URL each time, thereby reducing the risk of being forbidden. For a full-scale crawler, the risk is higher because it crawled too many pages within this site in a short time.
  • Open source framework: In the field of search engine, there is a famous framework called Nutch, and there are more awesome open source frameworks for vertical crawlers, such as WebMagic, SpiderMan and so on. But I have not found any framework for full-scale crawler. Of course, the moderate deformation of the other framework can meet some needs, but the performance is not perfect.

In addition, the daily work of the laboratory need a full-scale crawler and during the work, some workmates who study for data mining always want my help to crawl some websites, but the crawler system is not friendly to a novice. so I came up with the idea of WebHunger.

What's WebHunger

The name of WebHunger is a composition of words web and hunger, means to be hungry for web resources

WebHunger is an extensible, full-scale crawler framework that supports distributed crawling, aiming at getting users focused on parsing web page without concern for the crawling process. To get the result, the user only needs to submit the seed url and the page-parsed Java Class to this framework. After the crawling is completed, the framework would promptly return the crawling result to the user. With this framework, the user can have no idea about distributed programming, neither the knowledge of working mechanism of the crawler, which is greatly making the user easy to use.

In order to making the user easy to monitor, WebHunger provides a web console. Some screenshots as shown below.

image

In the page shown above, you can start crawling the site, pause, re-crawl and other operations.

image

And here you can see the site crawled progress, page parsed progress and the links which crawl failed.

How WebHunger works

image

All of the components with a yellow background in the above figure can be independently deployed to run on any server; solid lines indicate local calls; dotted lines indicate remote calls.

WebHunger consists of the following major components:

  1. Controller: Responsible for the management of the site
  2. Crawler: According to the Controller specified URL scheduling strategy, fetch URL from Redis to crawl, and seed crawled results to message queue.
  3. Page Consumer: Pull page message from message queue, and call the user-defined page parsed Java Class
  4. ZooKeeper Cluster: Do site state storage, service discovery, distributed lock and other work
  5. RocketMQ: Storage crawled page and page distribution
  6. Redis Cluster: Responsible for storing the URLs to be crawled and filtering duplicate URLs
  7. Web Console: A web-based monitor platform.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].