All Projects → spk → Maman

spk / Maman

Licence: mit
Rust Web Crawler saving pages on Redis

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Maman

Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (+212.82%)
Mutual labels:  crawler, spider, web-crawler
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+4030.77%)
Mutual labels:  crawler, spider, web-crawler
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+21417.95%)
Mutual labels:  crawler, spider, web-crawler
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (+835.9%)
Mutual labels:  crawler, spider, web-crawler
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (+23.08%)
Mutual labels:  crawler, spider, web-crawler
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+4928.21%)
Mutual labels:  crawler, spider, web-crawler
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+1582.05%)
Mutual labels:  crawler, spider, web-crawler
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (+366.67%)
Mutual labels:  crawler, spider, web-crawler
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+610.26%)
Mutual labels:  crawler, spider, web-crawler
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+12189.74%)
Mutual labels:  crawler, spider, web-crawler
Xxl Crawler
A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
Stars: ✭ 561 (+1338.46%)
Mutual labels:  crawler, spider
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+1274.36%)
Mutual labels:  crawler, spider
Xsrfprobe
The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.
Stars: ✭ 532 (+1264.1%)
Mutual labels:  crawler, spider
Go jobs
带你了解一下Golang的市场行情
Stars: ✭ 526 (+1248.72%)
Mutual labels:  crawler, spider
Douyin
API of DouYin for Humans used to Crawl Popular Videos and Musics
Stars: ✭ 580 (+1387.18%)
Mutual labels:  crawler, spider
Netdiscovery
NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。
Stars: ✭ 573 (+1369.23%)
Mutual labels:  crawler, spider
Newcrawler
Free Web Scraping Tool with Java
Stars: ✭ 589 (+1410.26%)
Mutual labels:  crawler, spider
Icrawler
A multi-thread crawler framework with many builtin image crawlers provided.
Stars: ✭ 629 (+1512.82%)
Mutual labels:  crawler, spider
Haipproxy
💖 High available distributed ip proxy pool, powerd by Scrapy and Redis
Stars: ✭ 4,993 (+12702.56%)
Mutual labels:  crawler, spider
Baiduimagespider
一个超级轻量的百度图片爬虫
Stars: ✭ 591 (+1415.38%)
Mutual labels:  crawler, spider

Maman

Maman is a Rust Web Crawler saving pages on Redis.

Pages are send to list <MAMAN_ENV>:queue:maman using Sidekiq job format

{
"class": "Maman",
"jid": "b4a577edbccf1d805744efa9",
"retry": true,
"created_at": 1461789979, "enqueued_at": 1461789979,
"args": {
    "document":"<html><body><a href='#' /><a href='/new' /></html>",
    "urls": ["https://example.net/new"],
    "headers": {"content-type": "text/html"},
    "url": "https://example.net/"
    }
}

Dependencies

Installation

With cargo

cargo install maman

With just

PREFIX=~/.local just install

Usage

maman URL [LIMIT] [MIME_TYPES]

LIMIT must be an integer or 0 is the default, meaning no limit.

Environment variables

Defaults

  • MAMAN_ENV=development
  • REDIS_URL="redis://127.0.0.1/"

Others

  • RUST_LOG=maman=info

LICENSE

The MIT License

Copyright (c) 2016-2020 Laurent Arnoud [email protected]


Build Version Documentation License Dependency status

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].