ScaleUnlimited / flink-crawler

Licence: Apache-2.0 License

Continuous scalable web crawler built on top of Flink and crawler-commons

Programming Languages

java

68154 projects - #9 most used programming language

Projects that are alternatives of or similar to flink-crawler

Gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Stars: ✭ 277 (+477.08%)

Mutual labels: crawler, spider, web-crawler, crawling

Spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+1266.67%)

Mutual labels: crawler, spider, web-crawler

Awesome Crawler

A collection of awesome web crawler,spider in different languages

Stars: ✭ 4,793 (+9885.42%)

Mutual labels: crawler, spider, web-crawler

Arachnid

Powerful web scraping framework for Crystal

Stars: ✭ 68 (+41.67%)

Mutual labels: crawler, spider, crawling

Webster

a reliable high-level web crawling & scraping framework for Node.js.

Stars: ✭ 364 (+658.33%)

Mutual labels: crawler, spider, crawling

Spider Flow

新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

Stars: ✭ 365 (+660.42%)

Mutual labels: crawler, spider, web-crawler

Crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Stars: ✭ 8,392 (+17383.33%)

Mutual labels: crawler, spider, web-crawler

Maman

Rust Web Crawler saving pages on Redis

Stars: ✭ 39 (-18.75%)

Mutual labels: crawler, spider, web-crawler

Crawlab Lite

Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

Stars: ✭ 122 (+154.17%)

Mutual labels: crawler, spider, web-crawler

Abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Stars: ✭ 1,961 (+3985.42%)

Mutual labels: crawler, spider, web-crawler

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (+256.25%)

Mutual labels: crawler, spider, crawling

Antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Stars: ✭ 198 (+312.5%)

Mutual labels: crawler, web-crawler, crawling

Spidy

The simple, easy to use command line web crawler.

Stars: ✭ 257 (+435.42%)

Mutual labels: crawler, web-crawler, crawling

Crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

Stars: ✭ 440 (+816.67%)

Mutual labels: crawler, spider, crawling

Skycaiji

蓝天采集器是一款免费的数据采集发布爬虫软件，采用php+mysql开发，可部署在云服务器，几乎能采集所有类型的网页，无缝对接各类CMS建站程序，免登录实时发布数据，全自动无需人工干预！是网页大数据采集软件中完全跨平台的云端爬虫系统

Stars: ✭ 1,514 (+3054.17%)

Mutual labels: crawler, spider, crawling

Pspider

简单易用的Python爬虫框架，QQ交流群：597510560

Stars: ✭ 1,611 (+3256.25%)

Mutual labels: crawler, spider, web-crawler

Zhihu Crawler People

A simple distributed crawler for zhihu && data analysis

Stars: ✭ 182 (+279.17%)

Mutual labels: crawler, spider, web-crawler

Colly

Elegant Scraper and Crawler Framework for Golang

Stars: ✭ 15,535 (+32264.58%)

Mutual labels: crawler, spider, crawling

talospider

talospider - A simple,lightweight scraping micro-framework

Stars: ✭ 57 (+18.75%)

Mutual labels: spider, crawling

Chromium for spider

dynamic crawler for web vulnerability scanner

Stars: ✭ 220 (+358.33%)

Mutual labels: crawler, spider

View All Similar Projects ➔

flink-crawler

A continuous scalable web crawler built on top of Flink and crawler-commons, with bits of code borrowed from bixo.

The primary goals of flink-crawler are:

Continuous, meaning pages are always being fetched. This avoids the inefficiencies of a batch-oriented crawler such as Bixo or Nutch, where the time spent processing the "crawl frontier" (aka CrawlDB) in each loop grows to where it winds up dominating the total time.
Scalable, meaning the crawler should work for small crawls of a 100K pages up to big crawls which fetch billions of pages and track 100B+ links.
Focused, meaning the crawler can be tuned to focus on pages and domains with the highest value, thus improving the efficiency of the crawl.
Simple, meaning operationally it should be easy to set up and run a crawl, without requiring additional infrastructure beyond what's needed for Flink.

See the Key Design Decisions page for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ScaleUnlimited / flink-crawler

Programming Languages

Labels

Projects that are alternatives of or similar to flink-crawler

flink-crawler