Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

skrapeit / Skrape.it

Licence: mit

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.

Programming Languages

kotlin

9241 projects

Labels

hacktoberfest testing crawler dom scraper parse test-automation integration-testing html-parser jsoup

Projects that are alternatives of or similar to Skrape.it

Hquery.php

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.

Stars: ✭ 295 (+27.71%)

Mutual labels: crawler, scraper, html-parser

page-content-tester

Paco is a Java based framework for non-blocking and highly parallelized Dom testing.

Stars: ✭ 13 (-94.37%)

Mutual labels: dom, test-automation, jsoup

Ferret

Declarative web scraping

Stars: ✭ 4,837 (+1993.94%)

Mutual labels: hacktoberfest, crawler, scraper

Htmlparser2

The fast & forgiving HTML and XML parser

Stars: ✭ 3,299 (+1328.14%)

Mutual labels: hacktoberfest, html-parser, dom

Jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Stars: ✭ 9,184 (+3875.76%)

Mutual labels: parse, jsoup, dom

Instagram Crawler

Crawl instagram photos, posts and videos for download.

Stars: ✭ 178 (-22.94%)

Mutual labels: crawler, scraper

Nosmoke

A cross platform UI crawler which scans view trees then generate and execute UI test cases.

Stars: ✭ 178 (-22.94%)

Mutual labels: crawler, test-automation

Goribot

[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Stars: ✭ 190 (-17.75%)

Mutual labels: crawler, scraper

Jvppeteer

Headless Chrome For Java （Java 爬虫）

Stars: ✭ 193 (-16.45%)

Mutual labels: crawler, scraper

Grammarinator

ANTLR v4 grammar-based test generator

Stars: ✭ 162 (-29.87%)

Mutual labels: hacktoberfest, test-automation

Gecco

Easy to use lightweight web crawler（易用的轻量化网络爬虫）

Stars: ✭ 2,310 (+900%)

Mutual labels: crawler, jsoup

Querylist

🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

Stars: ✭ 2,392 (+935.5%)

Mutual labels: crawler, scraper

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (-25.97%)

Mutual labels: crawler, scraper

Preact Markup

⚡️ Render HTML5 as VDOM, with Components as Custom Elements!

Stars: ✭ 167 (-27.71%)

Mutual labels: parse, dom

Unhtml.rs

A magic html parser

Stars: ✭ 180 (-22.08%)

Mutual labels: scraper, html-parser

Fuzzinator

Fuzzinator Random Testing Framework

Stars: ✭ 164 (-29%)

Mutual labels: hacktoberfest, test-automation

Openqa

openQA web-frontend, scheduler and tools.

Stars: ✭ 194 (-16.02%)

Mutual labels: hacktoberfest, test-automation

Media Scraper

Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok

Stars: ✭ 206 (-10.82%)

Mutual labels: crawler, scraper

Tianyancha

pip安装的天眼查爬虫API，指定的单个/多个企业工商信息一键保存为Excel/JSON格式。A Battery-included Scraper API of Tianyancha, the best Chinese business data and investigation platform.

Stars: ✭ 206 (-10.82%)

Mutual labels: crawler, scraper

Goose Parser

Universal scrapping tool, which allows you to extract data using multiple environments

Stars: ✭ 211 (-8.66%)

Mutual labels: crawler, scraper

View All Similar Projects ➔

skrape{it}

skrape{it} is a Kotlin-based HTML/XML testing and web scraping library that can be used seamlessly in Spring-Boot, Ktor, Android or other Kotlin-JVM projects. The ability to analyze and extract HTML including client-side rendered DOM trees and all other XML-related markup specifications such as SVG, UML, RSS,... makes it unique. It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. First and foremost skrape{it} aims to be a testing tool (not tied to a particular test runner), but it can also be used to scrape websites in a convenient fashion.

Features

Parsing

[x] Deserialization of HTML/XML from websites, local html files and html as string to data classes / POJOs.
[x] Designed to deserialize HTML but can handle any XML-related markup specifications such as SVG, UML, RSS or XML itself.
[x] DSL to select html elements as well as supporting CSS query-selector syntax by string invocation.

Http-Client

[x] Http-Client without verbosity and ceremony to make requests and corresponding request options like headers, cookies etc in a fluent style interface.
[x] Pre-configure client regarding auth and other request settings
[x] Can handle client side rendered web pages. Javascript execution results can optionally be considered in the response body.

Idomatic

[x] Easy to use, idiomatic and type-safe DSL to ensure a high level of readability.
[x] Build-in matchers/assertions based on infix functions to archive a very high level of readability.
[x] DSL is behaving like a Fluent-Api to make data extraction/scraping as comfortable as possible.

Compatibility

[x] Not bind to a specific test-runner, framework or whatever.
[x] Open to use any other assertion library of your choice.

Extensions

In addition, extensions for well-known testing libraries are provided to extend them with the mentioned skrape{it} functionality. Currently available:

skrape{it} MockMvc extension
skrape{it} Ktor extension

Quick Start

Read the Docs

You'll always find the latest documentation, release notes and examples regarding official releases at https://docs.skrape.it. The README file you are reading right now provides example related to the latest master. Just use it if you won't wait for latest changes to be released. If you don't want to read that much or just want to get a rough overview on how to use skrape{it}, you can have a look at the Documentation by Example section.

Installation

All our official/stable releases will be published to mavens central repository.

Add dependency

Gradle

dependencies {
    implementation("it.skrape:skrapeit-core:1.0.0-alpha8")
}

Maven

<dependency>
    <groupId>it.skrape</groupId>
    <artifactId>skrapeit-core</artifactId>
    <version>1.0.0-alpha8</version>
</dependency>

using bleeding edge features before official release

We are offering snapshot releases via jitpack. Thereby you can install every commit and version you want. But be careful, these are non official releases and may be unstable as well as breaking changes can occur at any time.

If you want be a bit more safe you can use a certain commit instead of referencing skrape{it}'s master branch as a dependency to avoid sudden breaking changes. ! Please make sure to go to https://jitpack.io/#skrapeit/skrape.it/ click the commits tab and the "Get it"-button to force a build of a certain commit.

Add experimental stuff

Gradle

repositories {
    maven { url = uri("https://jitpack.io") }
}
dependencies {
    implementation("com.github.skrapeit:skrape.it:master-SNAPSHOT")
    // or use a certain commit to avoid sudden breaking changes
    implementation("com.github.skrapeit:skrape.it:<commit-hash-short>")
}

Maven

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

...

<dependency>
    <groupId>com.github.skrapeit</groupId>
    <artifactId>skrape.it</artifactId>
    <version>master-SNAPSHOT</version>
</dependency>

Documentation by Example

Parse and verify HTML from String

@Test
fun `can read and return html from String`() {
    htmlDocument("""
        <html>
            <body>
                <h1>welcome</h1>
                <div>
                    <p>first p-element</p>
                    <p class="foo">some p-element</p>
                    <p class="foo">last p-element</p>
                </div>
            </body>
        </html>""") {

        h1 {
            findFirst {
                text toBe "welcome"
            }
            p {
                withClass = "foo"
                findFirst {
                    text toBe "some p-element"
                    className  toBe "foo"
                }
            }
            p {
                findAll {
                    text toContain "p-element"
                }
                findLast {
                    text toBe "last p-element"
                }
            }
        }
    }
}

Parse HTML and extract it

data class MyDataClass(
        var httpStatusCode: Int = 0,
        var httpStatusMessage: String = "",
        var paragraph: String = "",
        var allParagraphs: List<String> = emptyList(),
        var allLinks: List<String> = emptyList()
)

class HtmlExtractionService {

    fun extract() {
        val extracted = skrape(HttpFetcher) {
            request {
                url = "http://localhost:8080/"
            }           

            extractIt<MyDataClass> {
                it.httpStatusCode = statusCode
                it.httpStatusMessage = statusMessage.toString()
                htmlDocument {
                    it.allParagraphs = p { findAll { eachText }}
                    it.paragraph = p { findFirst { text }}
                    it.allLinks = a { findAll { eachHref }}
                }
            }
        }
        print(extracted)
        // will print:
        // MyDataClass(httpStatusCode=200, httpStatusMessage=OK, paragraph=i'm a paragraph, allParagraphs=[i'm a paragraph, i'm a second paragraph], allLinks=[http://some.url, http://some-other.url])
    }
}

Testing HTML responses:

@Test
fun `dsl can skrape by url`() {
    skrape(HttpFetcher) {
        request {
            url = "http://localhost:8080/example"
        }       
        expect {
            htmlDocument {
                // all official html and html5 elements are supported by the DSL
                div {
                    withClass = "foo" and "bar" and "fizz" and "buzz"

                    findFirst {
                        text toBe "div with class foo"

                        // it's possible to search for elements from former search results
                        // ⚠️ this is only available in jitpack version for now!
                        // e.g. search all matching span elements within the above div with class foo etc...
                        span {
                            findAll {
                                // do something
                            }                       
                        }                   
                    }

                    findAll {
                        toBePresentExactlyTwice
                    }
                }
                // can handle custom tags as well
                "a-custom-tag" {
                    findFirst {
                        toBePresentExactlyOnce
                        text toBe "i'm a custom html5 tag"
                        text
                    }
                }
                // can handle custom tags written in css selctor query syntax
                "div.foo.bar.fizz.buzz" {
                    findFirst {
                        text toBe "div with class foo"
                    }
                }

                // can handle custom tags and add selector specificas via DSL
                "div.foo" {

                    withClass = "bar" and "fizz" and "buzz"

                    findFirst {
                        text toBe "div with class foo"
                    }
                }
            }
        }
    }
}

Scrape a client side rendered page:

fun getDocumentByUrl(urlToScrape: String) = skrape(BrowserFetcher) { // <--- pass Browser fetcher to include rendered JS
    request { url = urlToScrape }
    extract { htmlDocument { this } }
}


fun main() {
    // do stuff with the document
    println(getDocumentByUrl("https://docs.skrape.it").eachLink)
}

Configure HTTP-Client:

class ExampleTest {
    val myPreConfiguredClient = skrape(HttpFetcher) {
        // url can be a plain url as string or build by #urlBuilder
        request {
            url = urlBuilder {
                protocol = UrlBuilder.Protocol.HTTPS
                host = "skrape.it"
                port = 12345
                path = "/foo"
                queryParam = mapOf("foo" to "bar")
            }
            timeout = 5000 // optional -> defaults to 5000ms
            followRedirects = true // optional -> defaults to true
            userAgent = "some custom user agent" // optional -> defaults to "Mozilla/5.0 skrape.it"
            cookies = mapOf("some-cookie-name" to "some-value") // optional
            headers = mapOf("some-custom-header" to "some-value") // optional
        }
        preConfigured
    }
    
    @Test
    fun `can use preconfigured client`() {
    
        myPreConfiguredClient.expect {
            status { code toBe 200 }
            // do more stuff
        }
    
        // slightly modify preconfigured client
        myPreConfiguredClient.apply {
            request {
                followRedirects = false
            }
        }.expect {
            status { code toBe 301 }
            // do more stuff
        }
    }
}

Get in touch

If you need help, have questions on how to use skrape{it} or want to discuss features or bugs please raise issues on GitHub. If you want to discuss or ask more generel things or possible implementations as well as feature ideas you can join the #skrape-it channel on the Kotlin Slack.

Issues: You can discuss and raise issues on GitHub.
Slack: Join the #skrape-it channel on the Kotlin Slack.
Twitter: Follow @skrape_it on Twitter for updates and release notifications.
Stasckoverflow: post or search issues on Stackoverflow

💖 Support the project

Skrape{it} is and always will be free and open-source. I try to reply to everyone needing help using these projects. Obviously, the development, maintenance takes time.

However, if you are using this project and be happy with it or just want to encourage me to continue creating stuff or fund the caffeine and pizzas that fuel its development, there are few ways you can do it :-

Starring and sharing the project 🚀 to help make it more popular
Giving proper credit when you use skrape{it}, tell your friends and others about it 😃
Sponsor Skrape{it} with a one-time donations via PayPal by just click this button → or use the GitHub sponsors programm to support on a monthly basis 💖

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 231

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (16) 🔗