Alternatives and detailed information of webhunger

jerry-sc / webhunger

Licence: Apache-2.0 license

WebHunger is an extensible, full-scale crawler framework that supports distributed crawling, aiming at getting users focused on web page parsing without concerning for the crawling process.

Programming Languages

java

68154 projects - #9 most used programming language

javascript

184084 projects - #8 most used programming language

CSS

56736 projects

Projects that are alternatives of or similar to webhunger

Pottery

Redis for humans. 🌎🌍🌏

Stars: ✭ 204 (+1100%)

Mutual labels: distributed

Dweb.page

Your Gateway to the Distributed Web

Stars: ✭ 239 (+1305.88%)

Mutual labels: distributed

Ray

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Stars: ✭ 18,547 (+109000%)

Mutual labels: distributed

Vernemq

A distributed MQTT message broker based on Erlang/OTP. Built for high quality & Industrial use cases.

Stars: ✭ 2,628 (+15358.82%)

Mutual labels: distributed

Brainiak

Brain Imaging Analysis Kit

Stars: ✭ 232 (+1264.71%)

Mutual labels: distributed

Spring Boot Start Current

Spring Boot 脚手架 Mybatis Spring Security JWT 权限 Spring Cache + Redis

Stars: ✭ 246 (+1347.06%)

Mutual labels: distributed

Scannerl

The modular distributed fingerprinting engine

Stars: ✭ 208 (+1123.53%)

Mutual labels: distributed

celery-monitor

The celery monitor app was written by Django.

Stars: ✭ 92 (+441.18%)

Mutual labels: distributed

Flambe

An ML framework to accelerate research and its path to production.

Stars: ✭ 236 (+1288.24%)

Mutual labels: distributed

Cat

CAT 作为服务端项目基础组件，提供了 Java, C/C++, Node.js, Python, Go 等多语言客户端，已经在美团点评的基础架构中间件框架（MVC框架，RPC框架，数据库框架，缓存框架等，消息队列，配置系统等）深度集成，为美团点评各业务线提供系统丰富的性能指标、健康状况、实时告警等。

Stars: ✭ 16,236 (+95405.88%)

Mutual labels: distributed

Ruby Spark

Ruby wrapper for Apache Spark

Stars: ✭ 221 (+1200%)

Mutual labels: distributed

Coerce Rs

Coerce - an asynchronous (async/await) Actor runtime and cluster framework for Rust

Stars: ✭ 231 (+1258.82%)

Mutual labels: distributed

Shardingsphere Elasticjob Cloud

Stars: ✭ 248 (+1358.82%)

Mutual labels: distributed

Pysr

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Stars: ✭ 213 (+1152.94%)

Mutual labels: distributed

Tensorflow

An Open Source Machine Learning Framework for Everyone

Stars: ✭ 161,335 (+948929.41%)

Mutual labels: distributed

Gerapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Stars: ✭ 2,601 (+15200%)

Mutual labels: distributed

Powerjob

Enterprise job scheduling middleware with distributed computing ability.

Stars: ✭ 3,231 (+18905.88%)

Mutual labels: distributed

Multi-Node-TimescaleDB

The multi-node setup of TimescaleDB 🐯🐯🐯 🐘 🐯🐯🐯

Stars: ✭ 42 (+147.06%)

Mutual labels: distributed

itc.lua

A Lua implementation of Interval Tree Clocks

Stars: ✭ 21 (+23.53%)

Mutual labels: distributed

Sia

Blockchain-based marketplace for file storage. Project has moved to GitLab: https://gitlab.com/NebulousLabs/Sia

Stars: ✭ 2,731 (+15964.71%)

Mutual labels: distributed

View All Similar Projects ➔

Developing Now......

Readme in Chinese

Motivation

Note: We call the crawler which crawl the entire site: A full-scale crawler.

Firstly, let’s summarize the various types of crawlers, their characteristics and the corresponding open source framework.

Crawler type	Crawl num	Forbidden risk	Open source framework
Search engine crawler	Unknown	Low	Nutch
Vertical crawler	Known	Moderate	WebMagic SpiderMan
Full-scale crawler	Unknown	High	None

Page number: If you can know what content we want and how much data we need before we crawl it, then it is easy to determine the performance of the crawler; On the contrary, it would get difficult when we do not know that. So it is a big problem for a full-scale crawler which try to crawl more pages.
Forbidden risk: The search engine crawlers crawl the entire Internet, so it is easy to avoid frequent crawling of a site through choosing different site’s URL each time, thereby reducing the risk of being forbidden. For a full-scale crawler, the risk is higher because it crawled too many pages within this site in a short time.
Open source framework: In the field of search engine, there is a famous framework called Nutch, and there are more awesome open source frameworks for vertical crawlers, such as WebMagic, SpiderMan and so on. But I have not found any framework for full-scale crawler. Of course, the moderate deformation of the other framework can meet some needs, but the performance is not perfect.

In addition, the daily work of the laboratory need a full-scale crawler and during the work, some workmates who study for data mining always want my help to crawl some websites, but the crawler system is not friendly to a novice. so I came up with the idea of WebHunger.

What's WebHunger

The name of WebHunger is a composition of words web and hunger, means to be hungry for web resources

WebHunger is an extensible, full-scale crawler framework that supports distributed crawling, aiming at getting users focused on parsing web page without concern for the crawling process. To get the result, the user only needs to submit the seed url and the page-parsed Java Class to this framework. After the crawling is completed, the framework would promptly return the crawling result to the user. With this framework, the user can have no idea about distributed programming, neither the knowledge of working mechanism of the crawler, which is greatly making the user easy to use.

In order to making the user easy to monitor, WebHunger provides a web console. Some screenshots as shown below.

In the page shown above, you can start crawling the site, pause, re-crawl and other operations.

And here you can see the site crawled progress, page parsed progress and the links which crawl failed.

How WebHunger works

All of the components with a yellow background in the above figure can be independently deployed to run on any server; solid lines indicate local calls; dotted lines indicate remote calls.

WebHunger consists of the following major components:

Controller: Responsible for the management of the site
Crawler: According to the Controller specified URL scheduling strategy, fetch URL from Redis to crawl, and seed crawled results to message queue.
Page Consumer: Pull page message from message queue, and call the user-defined page parsed Java Class
ZooKeeper Cluster: Do site state storage, service discovery, distributed lock and other work
RocketMQ: Storage crawled page and page distribution
Redis Cluster: Responsible for storing the URLs to be crawled and filtering duplicate URLs
Web Console: A web-based monitor platform.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

jerry-sc / webhunger

Programming Languages

Labels

Projects that are alternatives of or similar to webhunger

Developing Now......

Motivation

What's WebHunger

How WebHunger works