All Projects → antchfx → Htmlquery

antchfx / Htmlquery

Licence: mit
htmlquery is golang XPath package for HTML query.

Programming Languages

go
31211 projects - #10 most used programming language
golang
3204 projects

Projects that are alternatives of or similar to Htmlquery

Didom
Simple and fast HTML and XML parser
Stars: ✭ 1,939 (+473.67%)
Mutual labels:  html-parser, xpath
Html Agility Pack
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Stars: ✭ 2,014 (+495.86%)
Mutual labels:  xpath, html-parser
Fuzi
A fast & lightweight XML & HTML parser in Swift with XPath & CSS support
Stars: ✭ 894 (+164.5%)
Mutual labels:  xpath, html-parser
Nokogiri
HTML parser for PHP - Парсер HTML
Stars: ✭ 214 (-36.69%)
Mutual labels:  xpath, html-parser
Harser
Easy way for HTML parsing and building XPath
Stars: ✭ 135 (-60.06%)
Mutual labels:  xpath, html-parser
Jsoupxpath
纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4. Maybe it is the best in java,ha ha.Just try it.
Stars: ✭ 331 (-2.07%)
Mutual labels:  xpath, html-parser
sherpa 41
Simple browser engine.
Stars: ✭ 31 (-90.83%)
Mutual labels:  html-parser
html2any
🌀 parse and convert html string to anything
Stars: ✭ 43 (-87.28%)
Mutual labels:  html-parser
Z-Spider
一些爬虫开发的技巧和案例
Stars: ✭ 33 (-90.24%)
Mutual labels:  xpath
DouBanReptile
豆瓣租房小组多线程爬虫。爬取后自动按时间排序生成markdown文件。
Stars: ✭ 31 (-90.83%)
Mutual labels:  xpath
Fluentdom
A fluent api for working with XML in PHP
Stars: ✭ 327 (-3.25%)
Mutual labels:  xpath
Exist
eXist Native XML Database and Application Platform
Stars: ✭ 294 (-13.02%)
Mutual labels:  xpath
ElementFinder
Fetch data from HTML and XML via xpath/css and prepare it with regexp
Stars: ✭ 29 (-91.42%)
Mutual labels:  xpath
web-data-extractor
Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.
Stars: ✭ 52 (-84.62%)
Mutual labels:  xpath
spparser
an async ETL tool written in Python.
Stars: ✭ 34 (-89.94%)
Mutual labels:  xpath
codechef-rank-comparator
Web application hosted on Heroku cloud platform based on web scraping in python using lxml library (XML Path Language).
Stars: ✭ 23 (-93.2%)
Mutual labels:  xpath
Hquery.php
An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Stars: ✭ 295 (-12.72%)
Mutual labels:  html-parser
DAM
Temario y ejercicios de Desarrollo de Aplicaciones Multiplataforma (DAM)
Stars: ✭ 96 (-71.6%)
Mutual labels:  xpath
XPathTools
A Visual Studio Extension which can run any XPath and XPath function; navigates through results at the click of a button. Can show and copy any XPath incl. XML namespaces, avoiding XML namespace induced headaches. Keeps track of the current XPath via the statusbar.
Stars: ✭ 40 (-88.17%)
Mutual labels:  xpath
Htmlparser2
The fast & forgiving HTML and XML parser
Stars: ✭ 3,299 (+876.04%)
Mutual labels:  html-parser

htmlquery

Build Status Coverage Status GoDoc Go Report Card

Overview

htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.

htmlquery built-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.

Installation

go get github.com/antchfx/htmlquery

Getting Started

Query, returns matched elements or error.

nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}

Load HTML document from URL.

doc, err := htmlquery.LoadURL("http://example.com/")

Load HTML from document.

filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)

Load HTML document from string.

s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))

Find all A elements.

list := htmlquery.Find(doc, "//a")

Find all A elements that have href attribute.

list := htmlquery.Find(doc, "//a[@href]")	

Find all A elements with href attribute and only return href value.

list := htmlquery.Find(doc, "//a/@href")	
for _ , n := range list{
	fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
}

Find the third A element.

a := htmlquery.FindOne(doc, "//a[3]")

Find children element (img) under A href and print the source

a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value

Evaluate the number of all IMG element.

expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)

FAQ

Find() vs QueryAll(), which is better?

Find and QueryAll both do the same things, searches all of matched html nodes. The Find will panics if you give an error XPath query, but QueryAll will return an error for you.

Can I save my query expression object for the next query?

Yes, you can. We offer the QuerySelector and QuerySelectorAll methods, It will accept your query expression object.

Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.

XPath query object cache performance

goos: windows
goarch: amd64
pkg: github.com/antchfx/htmlquery
BenchmarkSelectorCache-4                20000000                55.2 ns/op
BenchmarkDisableSelectorCache-4           500000              3162 ns/op

How to disable caching?

htmlquery.DisableSelectorCache = true

Changelogs

2019-11-19

  • Add built-in query object cache feature, avoid re-compilation for the same query string. #16
  • Added LoadDoc 18

2019-10-05

  • Add new methods that compatible with invalid XPath expression error: QueryAll and Query.
  • Add QuerySelector and QuerySelectorAll methods, supported reused your query object.

2019-02-04

  • #7 Removed deprecated FindEach() and FindEachWithBreak() methods.

2018-12-28

  • Avoid adding duplicate elements to list for Find() method. #6

Tutorial

func main() {
	doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
	if err != nil {
		panic(err)
	}
	// Find all news item.
	list, err := htmlquery.QueryAll(doc, "//ol/li")
	if err != nil {
		panic(err)
	}
	for i, n := range list {
		a := htmlquery.FindOne(n, "//a")
		fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
	}
}

List of supported XPath query packages

Name Description
htmlquery XPath query package for the HTML document
xmlquery XPath query package for the XML document
jsonquery XPath query package for the JSON document

Questions

Please let me know if you have any questions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].