All Projects → h12w → html-query

h12w / html-query

Licence: BSD-2-Clause License
A fluent and functional approach to querying HTML

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to html-query

Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (+381.25%)
Mutual labels:  crawler, dom
Arachni
Web Application Security Scanner Framework
Stars: ✭ 2,942 (+6029.17%)
Mutual labels:  crawler, dom
Openrunner
Computest Openrunner: Benchmark and functional testing for frontend-heavy web applications
Stars: ✭ 16 (-66.67%)
Mutual labels:  dom
TumblTwo
TumblTwo, an Improved Fork of TumblOne, a Tumblr Downloader.
Stars: ✭ 57 (+18.75%)
Mutual labels:  crawler
CrawlBox
Easy way to brute-force web directory.
Stars: ✭ 118 (+145.83%)
Mutual labels:  crawler
lostark-wait-notifier
🐤️ Lost Ark wait notifier
Stars: ✭ 38 (-20.83%)
Mutual labels:  crawler
slime
🍰 一个可视化的爬虫平台
Stars: ✭ 27 (-43.75%)
Mutual labels:  crawler
ptt-web-crawler
PTT 網路版爬蟲
Stars: ✭ 20 (-58.33%)
Mutual labels:  crawler
snapcrawl
Crawl a website and take screenshots
Stars: ✭ 37 (-22.92%)
Mutual labels:  crawler
2017 PyConTW Talk
tw.pycon.org/2017/events/talk/314386410792550475/
Stars: ✭ 18 (-62.5%)
Mutual labels:  crawler
dijnet-bot
Az összes számlád még egy helyen :)
Stars: ✭ 17 (-64.58%)
Mutual labels:  crawler
Crawling-CV-Conference-Papers
Crawling CV conference papers with Python.
Stars: ✭ 32 (-33.33%)
Mutual labels:  crawler
BilibiliCrawler
🌀 crawl bilibili user info and video info for data analysis | BiliBili爬虫
Stars: ✭ 25 (-47.92%)
Mutual labels:  crawler
WebCrawler
一个轻量级、快速、多线程、多管道、灵活配置的网络爬虫。
Stars: ✭ 39 (-18.75%)
Mutual labels:  crawler
TripAdvisor-Crawling-Suite
Fetching hotel data from TripAdvisor.
Stars: ✭ 17 (-64.58%)
Mutual labels:  crawler
bots-zoo
No description or website provided.
Stars: ✭ 59 (+22.92%)
Mutual labels:  crawler
spiderable-middleware
🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks
Stars: ✭ 29 (-39.58%)
Mutual labels:  crawler
WeiboCrawler
无cookie版微博爬虫,可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。
Stars: ✭ 45 (-6.25%)
Mutual labels:  crawler
videodl
Videodl: A lightweight video downloader written by pure python.
Stars: ✭ 320 (+566.67%)
Mutual labels:  crawler
ZhengFang System Spider
🐛一只登录正方教务管理系统,爬取数据的小爬虫
Stars: ✭ 21 (-56.25%)
Mutual labels:  crawler

html-query: A fluent and functional approach to querying HTML DOM

GoDoc

html-query is a Go package that provides a fluent and functional interface for querying HTML DOM. It is based on golang.org/x/net/html.

Examples

  1. A simple example (under "examples" directory)
    r := get(`http://blog.golang.org/index`)
    defer r.Close()
    root, err := query.Parse(r)
    checkError(err)
    root.Div(Id("content")).Children(Class("blogtitle")).For(func(item *query.Node) {
        href := item.Ahref().Href()
        date := item.Span(Class("date")).Text()
        tags := item.Span(Class("tags")).Text()
        // ......
    })
  1. Generator of html-query (under "gen" directory)

    A large part of html-query is automatically generated from HTML spec. The spec is in HTML format, so the generator parses it using html-query itself.

Design

Here is a simple explanation of the design of html-query.

Functional query expressions

All functional definitions are defined in html-query/expr package.

  1. Checker and checker composition

    A checker is a function that accept and conditionally returns a *html.Node.

    type Checker func(*html.Node) *html.Node

Here are some checker examples:

    Id("id1")
    Class("c1")
    Div
    Abbr
    H1
    H2

Checkers can be combined as boolean expressions:

    And(Id("id1"), Class("c1"))
    Or(Class("c1"), Class("c2"))
    And(Class("c1"), Not(Class("c2")))
  1. Checker builder

    A checker builder is a function that returns a checker. "Id", "Class", "And", "Or", "Not" shown above are all checker builders. There are also some checker builder builder (function that returns a checker builder) defined in html-query when needed.

Fluent interface

Fluent interface (http://en.wikipedia.org/wiki/Fluent_interface) are defined in html-query package.

  1. Root node

    Function Parse returns the root node of an html document.

  2. Node finder

    Method Node.Find implements a BFS search for a node, e.g.

    node.Find(Div, Class("id1"))

But usually you can write the short form:

    node.Div(Class("id1"))
  1. Attribute getter

    Method Node.Attr can be used to get the value (or a regular expression submatch of the value) of a node, e.g.

    node.Attr("Id")
    node.Attr("href", "\(.*)")

But usually you can write the short form:

    node.Id()
    node.Href("\(.*)")
  1. Node iterator

    Method Node.Children and Node.Descendants each returns a node iterator (NodeIter). Method NodeIter.For can be used to loop through these nodes.

Alternative

If you prefer a jquery like DSL rather than functional way, you might want to try goquery: https://github.com/PuerkitoBio/goquery.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].