sihaelov / Harser
Licence: mit
Easy way for HTML parsing and building XPath
Stars: ✭ 135
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Harser
Fuzi
A fast & lightweight XML & HTML parser in Swift with XPath & CSS support
Stars: ✭ 894 (+562.22%)
Mutual labels: parser, xpath, html-parser
Meeseeks
An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.
Stars: ✭ 252 (+86.67%)
Mutual labels: parser, xpath
Posthtml
PostHTML is a tool to transform HTML/XML with JS plugins
Stars: ✭ 2,737 (+1927.41%)
Mutual labels: parser, html-parser
Htmlquery
htmlquery is golang XPath package for HTML query.
Stars: ✭ 338 (+150.37%)
Mutual labels: xpath, html-parser
Jsoupxpath
纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4. Maybe it is the best in java,ha ha.Just try it.
Stars: ✭ 331 (+145.19%)
Mutual labels: xpath, html-parser
Html Agility Pack
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Stars: ✭ 2,014 (+1391.85%)
Mutual labels: xpath, html-parser
Oga
Read-only mirror of https://gitlab.com/yorickpeterse/oga
Stars: ✭ 1,147 (+749.63%)
Mutual labels: parser, html-parser
Internettools
XPath/XQuery 3.1 interpreter for Pascal with compatibility modes for XPath 2.0/XQuery 1.0/3.0, custom and JSONiq extensions, XML/HTML parsers and classes for HTTP/S requests
Stars: ✭ 82 (-39.26%)
Mutual labels: parser, xpath
Hquery.php
An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Stars: ✭ 295 (+118.52%)
Mutual labels: parser, html-parser
Save For Offline
Android app for saving webpages for offline reading.
Stars: ✭ 114 (-15.56%)
Mutual labels: parser, html-parser
Html Parser
php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍
Stars: ✭ 510 (+277.78%)
Mutual labels: parser, html-parser
Sax Wasm
The first streamable, fixed memory XML, HTML, and JSX parser for WebAssembly.
Stars: ✭ 89 (-34.07%)
Mutual labels: parser, html-parser
Lua Gumbo
Moved to https://gitlab.com/craigbarnes/lua-gumbo
Stars: ✭ 116 (-14.07%)
Mutual labels: parser, html-parser
Typin
Declarative framework for interactive CLI applications
Stars: ✭ 126 (-6.67%)
Mutual labels: parser
Csly
a C# embeddable lexer and parser generator (.Net core)
Stars: ✭ 129 (-4.44%)
Mutual labels: parser
Prowide Core
Model and parsers for all SWIFT MT (FIN) messages
Stars: ✭ 125 (-7.41%)
Mutual labels: parser
Babylon
PSA: moved into babel/babel as @babel/parser -->
Stars: ✭ 1,692 (+1153.33%)
Mutual labels: parser
Harser
Harser is a library for easy extracting data from HTML and building XPath.
Installation
pip install harser
Examples
>>> from harser import Harser
>>> HTML = '''
<html><body>
<div class="header" id="id-header">
<li class="nav-item" data-nav="first-item" href="/nav1">First item</li>
<li class="nav-item" data-nav="second-item" href="/nav2">Second item</li>
<li class="nav-item" data-nav="third-item" href="/nav3">Third item</li>
</div>
<div>First layer
<h3>Lorem Ipsum</h3>
<span>Dolor sit amet</span>
</div>
<div>Second layer</div>
<div>Third layer
<span class="text">first block</span>
<span class="text">second block</span>
<span>third block</span>
</div>
<span>fourth layer</span>
<img />
<div class="footer" id="id-foobar" foobar="ab bc cde">
<h3 some-attr="hey">
<span id="foobar-span">foo ter</span>
</h3>
</div>
</body></html>
'''
>>> harser = Harser(HTML)
>>> harser.find('div', class_='header').children(class_='nav-item').find('text').extract()
# Or just
# harser.find(class_='nav-item').find('text').extract()
['First item', 'Second item', 'Third item']
>>> harser.find(class_='nav-item').get_attr('href').extract()
['/nav1', '/nav2', '/nav3']
# It is equally
>>> harser.find('div', class_='header', id='id-header')
>>> harser.find('div', attrs={'class': 'header', 'id': 'id-header'})
>>> harser.find(id__contains='bar').get_attr('class').extract()
['footer']
>>> harser.find(href__not_contains='2').find('text').extract()
['First item', 'Third item']
>>> harser.find(attrs={'data-nav__contains': 'second'}).next_siblings().find('text').extract()
['Third item']
>>> harser.find('li').parent().next_siblings(filters={'text__contains': 'Second'}).clean_extract()
['<div>Second layer</div>']
>>> harser.find('h3', filters={'[email protected]__starts_with': 'foo'}).get_attr('some-attr').extract()
['hey']
>>> harser.find('div').children('h3').xpath
'//descendant::div/h3'
Support the project
Please contact Michael Sinov if you want to support the Harser project.
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].