All Projects → msoap → html2data

msoap / html2data

Licence: MIT license
Library and cli for extracting data from HTML via CSS selectors

Programming Languages

go
31211 projects - #10 most used programming language
Roff
2310 projects
Makefile
30231 projects
HTML
75241 projects

Projects that are alternatives of or similar to html2data

Bugs-feed
Bug's feed is a local hosted portal where you can search for the latest news, videos, CVEs, vulnerabilities...
Stars: ✭ 90 (+45.16%)
Mutual labels:  scrapping
xash3d-switch
Check out https://github.com/fgsfdsfgs/xash3d-fwgs for an updated version
Stars: ✭ 60 (-3.23%)
Mutual labels:  homebrew
new-browserhax-XL
Another one!
Stars: ✭ 32 (-48.39%)
Mutual labels:  homebrew
pararius-apartment-hunting-dashboard
🏡 A NodeJS server to scrape Pararius listings and show them in a dashboard 🏘️
Stars: ✭ 37 (-40.32%)
Mutual labels:  scrapping
.config
⚙️ Bootstrappable user environment for macOS & Ubuntu
Stars: ✭ 31 (-50%)
Mutual labels:  homebrew
craftus reloaded
A second attempt at a homebrew Minecraft clone for 3DS
Stars: ✭ 44 (-29.03%)
Mutual labels:  homebrew
brazil-civil-registry-data
Raw scrapings of ARPEN https://transparencia.registrocivil.org.br/
Stars: ✭ 35 (-43.55%)
Mutual labels:  scrapping
homebrew-palm-os
Homebrew formulae for working with Palm OS devices
Stars: ✭ 24 (-61.29%)
Mutual labels:  homebrew
Insider-Trading
This program extracts insider trading data from the sec website and stores it in excel file for the specified time frame.
Stars: ✭ 43 (-30.65%)
Mutual labels:  extract-data
NSW-Custom-Game-Icons
Nintendo Switch custom game icons, icon repo for NX-GiC
Stars: ✭ 33 (-46.77%)
Mutual labels:  homebrew
action-homebrew-bump-formula
⚙️ A GitHub Action to easily bump Homebrew formula on new release
Stars: ✭ 68 (+9.68%)
Mutual labels:  homebrew
visdom
A library use jQuery like API for html parsing & node selecting & node mutation, suitable for web scraping and html confusion.
Stars: ✭ 80 (+29.03%)
Mutual labels:  css-selector
homebrew-ecmascript
Homebrew formulae for ECMAScript engines
Stars: ✭ 13 (-79.03%)
Mutual labels:  homebrew
dotfiles
Home for my bootstrap script, dotfiles, and configuration files
Stars: ✭ 89 (+43.55%)
Mutual labels:  homebrew
homebrew-aws
Homebrew is a package manager for macOS which provides easy installation and update management of additional software. This Tap (repository) contains the Formulae that are used in the macOS AMI that AWS offers.
Stars: ✭ 50 (-19.35%)
Mutual labels:  homebrew
homebrew-bottle-mirror
mirror tool to sync homebrew bottle files
Stars: ✭ 36 (-41.94%)
Mutual labels:  homebrew
dotfiles
My dotfiles, meant for use on macOS computers
Stars: ✭ 48 (-22.58%)
Mutual labels:  homebrew
Sonic-1-2-2013-Decompilation
Sonic 1/2 (2013) Decompilation for New 3DS
Stars: ✭ 41 (-33.87%)
Mutual labels:  homebrew
extract-colors-py
Extract colors from an image. Colors are grouped based on visual similarities using the CIE76 formula.
Stars: ✭ 48 (-22.58%)
Mutual labels:  extract-data
HWL-SaveEditor
An Save-Editor for the game Hyrule Warriors Legends (Nintendo 3DS)
Stars: ✭ 18 (-70.97%)
Mutual labels:  homebrew

html2data

Go Reference Go Coverage Status Sourcegraph Report Card

Library and cli-utility for extracting data from HTML via CSS selectors

Install

Install package and command line utility:

go install github.com/msoap/html2data/cmd/html2data@latest

Install package only:

go get -u github.com/msoap/html2data

Methods

  • FromReader(io.Reader) - create document for parse
  • FromURL(URL, [config URLCfg]) - create document from http(s) URL
  • FromFile(file) - create document from local file
  • doc.GetData(css map[string]string) - get texts by CSS selectors
  • doc.GetDataFirst(css map[string]string) - get texts by CSS selectors, get first entry for each selector or ""
  • doc.GetDataNested(outerCss string, css map[string]string) - extract nested data by CSS-selectors from another CSS-selector
  • doc.GetDataNestedFirst(outerCss string, css map[string]string) - extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""
  • doc.GetDataSingle(css string) - get one result by one CSS selector

or with config:

  • doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})
  • doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})
  • doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})

Pseudo-selectors

  • :attr(attr_name) - getting attribute instead of text, for example getting urls from links: a:attr(href)
  • :html - getting HTML instead of text
  • :get(N) - getting n-th element from list

Example

package main

import (
    "fmt"
    "log"

    "github.com/msoap/html2data"
)

func main() {
    doc := html2data.FromURL("http://example.com")
    // or with config
    // doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})
    if doc.Err != nil {
        log.Fatal(doc.Err)
    }

    // get title
    title, _ := doc.GetDataSingle("title")
    fmt.Println("Title is:", title)

    title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})
    fmt.Println("Title as is, with spaces:", title)

    texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})
    // get all H1 headers:
    if textOne, ok := texts["h1"]; ok {
        for _, text := range textOne {
            fmt.Println(text)
        }
    }
    // get all urls from links
    if links, ok := texts["links"]; ok {
        for _, text := range links {
            fmt.Println(text)
        }
    }
}

Command line utility

Homebrew formula exists

Usage

html2data [options] URL "css selector"
html2data [options] URL :name1 "css1" :name2 "css2"...
html2data [options] file.html "css selector"
cat file.html | html2data "css selector"

Options

  • -user-agent="Custom UA" -- set custom user-agent
  • -find-in="outer.css.selector" -- search in the specified elements instead document
  • -json -- get result as JSON
  • -dont-trim-spaces -- get text as is
  • -dont-detect-charset -- don't detect charset and convert text
  • -timeout=10 -- setting timeout when loading the URL

Install

Download binaries from: releases (OS X/Linux/Windows/RaspberryPi)

Or install from homebrew (MacOS):

brew tap msoap/tools
brew install html2data
# update:
brew upgrade html2data

Using snap (Ubuntu or any Linux distribution with snap):

# install stable version:
sudo snap install html2data

# install the latest version:
sudo snap install --edge html2data

# update
sudo snap refresh html2data

From source:

go get -u github.com/msoap/html2data/cmd/html2data

examples

Get title of page:

html2data https://go.dev/ title

Last blog posts:

html2data https://go.dev/blog/ 'div#blogindex p.blogtitle a'

Getting RSS URL:

html2data https://go.dev/blog/ 'link[type="application/atom+xml"]:attr(href)'

More examples from wiki.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].