All Projects → xtuhcy → Gecco

xtuhcy / Gecco

Licence: mit
Easy to use lightweight web crawler(易用的轻量化网络爬虫)

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Gecco

crawler
A simple and flexible web crawler framework for java.
Stars: ✭ 20 (-99.13%)
Mutual labels:  crawler, jsoup
Awesome Java Crawler
本仓库收集整理爬虫相关资源,开发语言以Java为主
Stars: ✭ 228 (-90.13%)
Mutual labels:  crawler, jsoup
Crawlerpack
Java 網路資料爬蟲包
Stars: ✭ 99 (-95.71%)
Mutual labels:  crawler, jsoup
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (-90%)
Mutual labels:  crawler, jsoup
Crawlerforreader
Android 本地网络小说爬虫,基于jsoup及xpath
Stars: ✭ 312 (-86.49%)
Mutual labels:  crawler, jsoup
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (-84.2%)
Mutual labels:  crawler, jsoup
D4n155
OWASP D4N155 - Intelligent and dynamic wordlist using OSINT
Stars: ✭ 105 (-95.45%)
Mutual labels:  crawler, dynamic
Sakuraanime
使用jsoup爬取樱花动漫部分内容编写的第三方Android客户端。
Stars: ✭ 177 (-92.34%)
Mutual labels:  jsoup
Lianjia Beike Spider
链家网和贝壳网房价爬虫,采集北京上海广州深圳等21个中国主要城市的房价数据(小区,二手房,出租房,新房),稳定可靠快速!支持csv,MySQL, MongoDB,Excel, json存储,支持Python2和3,图表展示数据,注释丰富 ,点星支持,仅供学习参考,请勿用于商业用途,后果自负。
Stars: ✭ 2,257 (-2.29%)
Mutual labels:  crawler
Leetcode Spider
用 node.js 爬你自己的 leetcode 解题源码
Stars: ✭ 176 (-92.38%)
Mutual labels:  crawler
Ncov2019 data crawler
疫情数据爬虫,2019新型冠状病毒数据仓库,轨迹数据,同乘数据,报道
Stars: ✭ 175 (-92.42%)
Mutual labels:  crawler
N2h4
네이버 뉴스 수집을 위한 도구
Stars: ✭ 177 (-92.34%)
Mutual labels:  crawler
Web Bee
🐝 Web vertical crawler framework for fun
Stars: ✭ 184 (-92.03%)
Mutual labels:  crawler
Aesthetic
[DEPRECATED]
Stars: ✭ 2,044 (-11.52%)
Mutual labels:  dynamic
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-91.95%)
Mutual labels:  crawler
Transitioner
A library for dynamic view-to-view transitions
Stars: ✭ 2,049 (-11.3%)
Mutual labels:  dynamic
Now
Jsoup + MaterialViewPager + RxJava2 + Retrofit + Lifecycle + Realm + Fresco + Retrolambda example 一款Android图文精选app,通过抓取网页获得图文列表。目前包含猫弄(MONO)早午茶、站酷(Zcool)精选、国家地理(National Geographic)每日一图、知乎日报、豆瓣一刻(Moment)。
Stars: ✭ 189 (-91.82%)
Mutual labels:  jsoup
Comiccrawler
An image crawler written in Python.
Stars: ✭ 185 (-91.99%)
Mutual labels:  crawler
Jsontokotlinclass
🚀 Plugin for Android Studio And IntelliJ Idea to generate Kotlin data class code from JSON text ( Json to Kotlin )
Stars: ✭ 2,438 (+5.54%)
Mutual labels:  fastjson
Crawler illegal cases in china
Collection of China illegal cases about web crawler 本项目用来整理所有中国大陆爬虫开发者涉诉与违规相关的新闻、资料与法律法规。致力于帮助在中国大陆工作的爬虫行业从业者了解我国相关法律,避免触碰数据合规红线。 [AD]中文知识图谱门户
Stars: ✭ 2,448 (+5.97%)
Mutual labels:  crawler

ci maven 996.icu

What is Gecco

Gecco is a easy to use lightweight web crawler developed with java language.Gecco integriert jsoup, httpclient, fastjson, spring, htmlunit, redission ausgezeichneten framework,Let you only need to configure a number of jQuery style selector can be very quick to write a crawler.Gecco framework has excellent scalability, the framework based on the principle of open and close design, to modify the closure, the expansion of open.At the same time Gecco is based on a very open MIT open source protocol, whether you are a user or want to jointly improve the Gecco developer, welcome to request pull.If you like the crawler framework,please star or fork!

Main features

  • Easy to use, use jQuery style selector to extract elements
  • Support for asynchronous Ajax requests in the page
  • Support page JavaScript variable extraction
  • Using Redis to realize distributed crawling,reference gecco-redis
  • Support the development of business logic with Spring,reference gecco-spring
  • Support htmlunit extension,reference gecco-htmlunit
  • Support extension mechanism
  • Support download UserAgent random selection
  • Support the download proxy server randomly selected

Framework overview

架构图

Download

Download via Maven

<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco</artifactId>
    <version>x.x.x</version>
</dependency>

maven

Dependent project

httpclient,jsoup,fastjson,reflections,cglib,rhino,log4j,jmxutils,commons-lang3

Quick start

@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {

    private static final long serialVersionUID = -7127412585200687225L;

    @RequestParameter("user")
    private String user;

    @RequestParameter("project")
    private String project;

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(2) .social-count")
    private String star;

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(3) .social-count")
    private String fork;

    @Html
    @HtmlField(cssPath=".entry-content")
    private String readme;

    public String getReadme() {
        return readme;
    }

    public void setReadme(String readme) {
        this.readme = readme;
    }

    public String getUser() {
        return user;
    }

    public void setUser(String user) {
        this.user = user;
    }

    public String getProject() {
        return project;
    }

    public void setProject(String project) {
        this.project = project;
    }

    public String getStar() {
        return star;
    }

    public void setStar(String star) {
        this.star = star;
    }

    public String getFork() {
        return fork;
    }

    public void setFork(String fork) {
        this.fork = fork;
    }

    public static void main(String[] args) {
        GeccoEngine.create()
        .classpath("com.geccocrawler.gecco.demo")
        .start("https://github.com/xtuhcy/gecco")
        .thread(1)
        .interval(2000)
        .loop(true)
        .mobile(false)
        .start();
    }
}

DynamicGecco

The purpose of DynamicGecco is to implement the runtime configuration of the crawl rule without defining the SpiderBean.In fact, the principle is the use of byte code programming, dynamic generation of SpiderBean, but also through the custom GeccoClassLoader to achieve the rule of hot deployment.Below is a simple Demo, more complex Demo can refer to the example below com.geccocrawler.gecco.demo.dynamic.

The following code implements the runtime configuration of the crawl rule:

DynamicGecco.html()
.gecco("https://github.com/{user}/{project}", "consolePipeline")
.requestField("request").request().build()
.stringField("user").requestParameter("user").build()
.stringField("project").requestParameter().build()
.stringField("star").csspath(".pagehead-actions li:nth-child(2) .social-count").text(false).build()
.stringField("fork").csspath(".pagehead-actions li:nth-child(3) .social-count").text().build()
.stringField("contributors").csspath("ul.numbers-summary > li:nth-child(4) > a").href().build()
.register();

GeccoEngine.create()
.classpath("com.geccocrawler.gecco.demo")
.start("https://github.com/xtuhcy/gecco")
.run();

You can see that the DynamicGecco way compared to the traditional way of annotation code greatly reduced, and a very cool point is DynamicGecco to support the operation of the definition and modification of rules.

Demo

教您使用 java 爬虫 gecco 抓取 JD 全部商品信息(一)

教您使用 java 爬虫 gecco 抓取 JD 全部商品信息(二)

教您使用 java 爬虫 gecco 抓取 JD 全部商品信息(三)

集成 Htmlunit 下载页面

爬虫的监控

一个完整的例子,分页处理,结合 spring,mysql 入库

Similar Tool Comparison

A list of similar tools and how they compare is available here:

Web Archiving Software Comparision

Contact and communication

请作者喝杯咖啡

Gecco 的发展离不开大家支持,扫一扫请作者喝杯咖啡~

支付宝 支付宝

License

Please follow the open source protocol MIT!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].