All Projects → sihaelov → Harser

sihaelov / Harser

Licence: mit
Easy way for HTML parsing and building XPath

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Harser

Fuzi
A fast & lightweight XML & HTML parser in Swift with XPath & CSS support
Stars: ✭ 894 (+562.22%)
Mutual labels:  parser, xpath, html-parser
Meeseeks
An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.
Stars: ✭ 252 (+86.67%)
Mutual labels:  parser, xpath
Posthtml
PostHTML is a tool to transform HTML/XML with JS plugins
Stars: ✭ 2,737 (+1927.41%)
Mutual labels:  parser, html-parser
Htmlquery
htmlquery is golang XPath package for HTML query.
Stars: ✭ 338 (+150.37%)
Mutual labels:  xpath, html-parser
Didom
Simple and fast HTML and XML parser
Stars: ✭ 1,939 (+1336.3%)
Mutual labels:  html-parser, xpath
Nokogiri
HTML parser for PHP - Парсер HTML
Stars: ✭ 214 (+58.52%)
Mutual labels:  xpath, html-parser
Jsoupxpath
纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4. Maybe it is the best in java,ha ha.Just try it.
Stars: ✭ 331 (+145.19%)
Mutual labels:  xpath, html-parser
Html Agility Pack
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Stars: ✭ 2,014 (+1391.85%)
Mutual labels:  xpath, html-parser
Oga
Read-only mirror of https://gitlab.com/yorickpeterse/oga
Stars: ✭ 1,147 (+749.63%)
Mutual labels:  parser, html-parser
Internettools
XPath/XQuery 3.1 interpreter for Pascal with compatibility modes for XPath 2.0/XQuery 1.0/3.0, custom and JSONiq extensions, XML/HTML parsers and classes for HTTP/S requests
Stars: ✭ 82 (-39.26%)
Mutual labels:  parser, xpath
Hquery.php
An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Stars: ✭ 295 (+118.52%)
Mutual labels:  parser, html-parser
Save For Offline
Android app for saving webpages for offline reading.
Stars: ✭ 114 (-15.56%)
Mutual labels:  parser, html-parser
Html Parser
php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍
Stars: ✭ 510 (+277.78%)
Mutual labels:  parser, html-parser
Sax Wasm
The first streamable, fixed memory XML, HTML, and JSX parser for WebAssembly.
Stars: ✭ 89 (-34.07%)
Mutual labels:  parser, html-parser
Lua Gumbo
Moved to https://gitlab.com/craigbarnes/lua-gumbo
Stars: ✭ 116 (-14.07%)
Mutual labels:  parser, html-parser
Typin
Declarative framework for interactive CLI applications
Stars: ✭ 126 (-6.67%)
Mutual labels:  parser
Csly
a C# embeddable lexer and parser generator (.Net core)
Stars: ✭ 129 (-4.44%)
Mutual labels:  parser
Gofeed
Parse RSS, Atom and JSON feeds in Go
Stars: ✭ 1,762 (+1205.19%)
Mutual labels:  parser
Prowide Core
Model and parsers for all SWIFT MT (FIN) messages
Stars: ✭ 125 (-7.41%)
Mutual labels:  parser
Babylon
PSA: moved into babel/babel as @babel/parser -->
Stars: ✭ 1,692 (+1153.33%)
Mutual labels:  parser

Harser

Build Status Coverage Status Wheel Status PRs Welcome PyPI Version

Harser is a library for easy extracting data from HTML and building XPath.

Installation

pip install harser

Examples

>>> from harser import Harser

>>> HTML = '''
    <html><body>
    <div class="header" id="id-header">
        <li class="nav-item" data-nav="first-item" href="/nav1">First item</li>
        <li class="nav-item" data-nav="second-item" href="/nav2">Second item</li>
        <li class="nav-item" data-nav="third-item" href="/nav3">Third item</li>
    </div>
    <div>First layer
        <h3>Lorem Ipsum</h3>
        <span>Dolor sit amet</span>
    </div>
    <div>Second layer</div>
    <div>Third layer
        <span class="text">first block</span>
        <span class="text">second block</span>
        <span>third block</span>
    </div>
    <span>fourth layer</span>
    <img />
    <div class="footer" id="id-foobar" foobar="ab bc cde">
        <h3 some-attr="hey">
            <span id="foobar-span">foo ter</span>
        </h3>
    </div>
    </body></html>
'''

>>> harser = Harser(HTML)

>>> harser.find('div', class_='header').children(class_='nav-item').find('text').extract()
# Or just
# harser.find(class_='nav-item').find('text').extract()
['First item', 'Second item', 'Third item']

>>> harser.find(class_='nav-item').get_attr('href').extract()
['/nav1', '/nav2', '/nav3']

# It is equally
>>> harser.find('div', class_='header', id='id-header')
>>> harser.find('div', attrs={'class': 'header', 'id': 'id-header'})

>>> harser.find(id__contains='bar').get_attr('class').extract()
['footer']

>>> harser.find(href__not_contains='2').find('text').extract()
['First item', 'Third item']

>>> harser.find(attrs={'data-nav__contains': 'second'}).next_siblings().find('text').extract()
['Third item']

>>> harser.find('li').parent().next_siblings(filters={'text__contains': 'Second'}).clean_extract()
['<div>Second layer</div>']

>>> harser.find('h3', filters={'[email protected]__starts_with': 'foo'}).get_attr('some-attr').extract()
['hey']

>>> harser.find('div').children('h3').xpath
'//descendant::div/h3'

Support the project

Please contact Michael Sinov if you want to support the Harser project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].