All Projects → ScaleUnlimited → flink-crawler

ScaleUnlimited / flink-crawler

Licence: Apache-2.0 License
Continuous scalable web crawler built on top of Flink and crawler-commons

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to flink-crawler

Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+477.08%)
Mutual labels:  crawler, spider, web-crawler, crawling
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+1266.67%)
Mutual labels:  crawler, spider, web-crawler
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+9885.42%)
Mutual labels:  crawler, spider, web-crawler
Arachnid
Powerful web scraping framework for Crystal
Stars: ✭ 68 (+41.67%)
Mutual labels:  crawler, spider, crawling
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (+658.33%)
Mutual labels:  crawler, spider, crawling
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (+660.42%)
Mutual labels:  crawler, spider, web-crawler
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+17383.33%)
Mutual labels:  crawler, spider, web-crawler
Maman
Rust Web Crawler saving pages on Redis
Stars: ✭ 39 (-18.75%)
Mutual labels:  crawler, spider, web-crawler
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (+154.17%)
Mutual labels:  crawler, spider, web-crawler
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+3985.42%)
Mutual labels:  crawler, spider, web-crawler
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+256.25%)
Mutual labels:  crawler, spider, crawling
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (+312.5%)
Mutual labels:  crawler, web-crawler, crawling
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (+435.42%)
Mutual labels:  crawler, web-crawler, crawling
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+816.67%)
Mutual labels:  crawler, spider, crawling
Skycaiji
蓝天采集器是一款免费的数据采集发布爬虫软件,采用php+mysql开发,可部署在云服务器,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Stars: ✭ 1,514 (+3054.17%)
Mutual labels:  crawler, spider, crawling
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+3256.25%)
Mutual labels:  crawler, spider, web-crawler
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (+279.17%)
Mutual labels:  crawler, spider, web-crawler
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+32264.58%)
Mutual labels:  crawler, spider, crawling
talospider
talospider - A simple,lightweight scraping micro-framework
Stars: ✭ 57 (+18.75%)
Mutual labels:  spider, crawling
Chromium for spider
dynamic crawler for web vulnerability scanner
Stars: ✭ 220 (+358.33%)
Mutual labels:  crawler, spider

flink-crawler

A continuous scalable web crawler built on top of Flink and crawler-commons, with bits of code borrowed from bixo.

The primary goals of flink-crawler are:

  • Continuous, meaning pages are always being fetched. This avoids the inefficiencies of a batch-oriented crawler such as Bixo or Nutch, where the time spent processing the "crawl frontier" (aka CrawlDB) in each loop grows to where it winds up dominating the total time.
  • Scalable, meaning the crawler should work for small crawls of a 100K pages up to big crawls which fetch billions of pages and track 100B+ links.
  • Focused, meaning the crawler can be tuned to focus on pages and domains with the highest value, thus improving the efficiency of the crawl.
  • Simple, meaning operationally it should be easy to set up and run a crawl, without requiring additional infrastructure beyond what's needed for Flink.

See the Key Design Decisions page for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].