All Projects → hellokaton → elves

hellokaton / elves

Licence: MIT license
🎊 Design and implement of lightweight crawler framework.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to elves

163Music
163music spider by scrapy.
Stars: ✭ 60 (-81.37%)
Mutual labels:  spider, scrapy
Py Elasticsearch Django
基于python语言开发的千万级别搜索引擎
Stars: ✭ 207 (-35.71%)
Mutual labels:  spider, scrapy
Scrapydweb
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO 👉
Stars: ✭ 2,385 (+640.68%)
Mutual labels:  spider, scrapy
Fp Server
Free proxy server, continuously crawling and providing proxies, based on Tornado and Scrapy. 免费代理服务器,基于Tornado和Scrapy,在本地搭建属于自己的代理池
Stars: ✭ 154 (-52.17%)
Mutual labels:  spider, scrapy
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (-73.91%)
Mutual labels:  spider, scrapy
Scrapingoutsourcing
ScrapingOutsourcing专注分享爬虫代码 尽量每周更新一个
Stars: ✭ 164 (-49.07%)
Mutual labels:  spider, scrapy
devsearch
A web search engine built with Python which uses TF-IDF and PageRank to sort search results.
Stars: ✭ 52 (-83.85%)
Mutual labels:  spider, scrapy
Scrapy demo
all kinds of scrapy demo
Stars: ✭ 128 (-60.25%)
Mutual labels:  spider, scrapy
Spider job
招聘网数据爬虫
Stars: ✭ 234 (-27.33%)
Mutual labels:  spider, scrapy
Spiderkeeper
admin ui for scrapy/open source scrapinghub
Stars: ✭ 2,562 (+695.65%)
Mutual labels:  spider, scrapy
Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+561.18%)
Mutual labels:  spider, scrapy
small-spider-project
日常爬虫
Stars: ✭ 14 (-95.65%)
Mutual labels:  spider, scrapy
Awesome Web Scraper
A collection of awesome web scaper, crawler.
Stars: ✭ 147 (-54.35%)
Mutual labels:  spider, scrapy
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-42.24%)
Mutual labels:  spider, scrapy
Taobaoscrapy
😩Tool For Taobao/Tmall| 儿时玩具已经过时
Stars: ✭ 146 (-54.66%)
Mutual labels:  spider, scrapy
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (-40.99%)
Mutual labels:  spider, scrapy
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-62.11%)
Mutual labels:  spider, scrapy
Feapder
feapder是一款支持分布式、批次采集、任务防丢、报警丰富的python爬虫框架
Stars: ✭ 110 (-65.84%)
Mutual labels:  spider, scrapy
Gerapy
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Stars: ✭ 2,601 (+707.76%)
Mutual labels:  spider, scrapy
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (-81.37%)
Mutual labels:  spider, scrapy

Elves

一个轻量级的爬虫框架设计与实现,博文分析

@biezhi on zhihu

特性

  • 事件驱动
  • 易于定制
  • 多线程执行
  • CSS 选择器和 XPath 支持

Maven 坐标

<dependency>
    <groupId>io.github.biezhi</groupId>
    <artifactId>elves</artifactId>
    <version>0.0.2</version>
</dependency>

如果你想在本地运行这个项目源码,请确保你是 Java8 环境并且安装了 lombok 插件。

架构图

调用流程图

快速上手

搭建一个爬虫程序需要进行这么几步操作

  1. 编写一个爬虫类继承自 Spider
  2. 设置要抓取的 URL 列表
  3. 实现 Spiderparse 方法
  4. 添加 Pipeline 处理 parse 过滤后的数据

举个栗子:

public class DoubanSpider extends Spider {

    public DoubanSpider(String name) {
        super(name);
        this.startUrls(
            "https://movie.douban.com/tag/爱情",
            "https://movie.douban.com/tag/喜剧",
            "https://movie.douban.com/tag/动画",
            "https://movie.douban.com/tag/动作",
            "https://movie.douban.com/tag/史诗",
            "https://movie.douban.com/tag/犯罪");
    }

    @Override
    public void onStart(Config config) {
        this.addPipeline((Pipeline<List<String>>) (item, request) -> log.info("保存到文件: {}", item));
    }

    public Result parse(Response response) {
        Result<List<String>> result   = new Result<>();
        Elements             elements = response.body().css("#content table .pl2 a");

        List<String> titles = elements.stream().map(Element::text).collect(Collectors.toList());
        result.setItem(titles);

        // 获取下一页 URL
        Elements nextEl = response.body().css("#content > div > div.article > div.paginator > span.next > a");
        if (null != nextEl && nextEl.size() > 0) {
            String  nextPageUrl = nextEl.get(0).attr("href");
            Request nextReq     = this.makeRequest(nextPageUrl, this::parse);
            result.addRequest(nextReq);
        }
        return result;
    }

}

public static void main(String[] args) {
    DoubanSpider doubanSpider = new DoubanSpider("豆瓣电影");
    Elves.me(doubanSpider, Config.me()).start();
}

爬虫例子

开源协议

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].