chuanconggao / Html2json
Licence: mit
Lightweight library that converts a HTML webpage to JSON data using a template defined in JSON.
Stars: ✭ 18
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Html2json
Ky
🌳 Tiny & elegant JavaScript HTTP client based on the browser Fetch API
Stars: ✭ 7,047 (+39050%)
Mutual labels: json
Jackson Module Kotlin
Module that adds support for serialization/deserialization of Kotlin (http://kotlinlang.org) classes and data classes.
Stars: ✭ 830 (+4511.11%)
Mutual labels: json
Rss Parser
A lightweight RSS parser, for Node and the browser
Stars: ✭ 793 (+4305.56%)
Mutual labels: json
Telize
High performance JSON IP and GeoIP REST API (IP Geolocation)
Stars: ✭ 774 (+4200%)
Mutual labels: json
Acf 5 Pro Json Storage
Save ACF 5 Pro field groups as JSON within this plugin, rather than inside your theme.
Stars: ✭ 16 (-11.11%)
Mutual labels: json
Winterfell
Generate complex, validated and extendable JSON-based forms in React.
Stars: ✭ 787 (+4272.22%)
Mutual labels: json
Jsonlite
A simple, self-contained, serverless, zero-configuration, json document store.
Stars: ✭ 819 (+4450%)
Mutual labels: json
Json to dart
Library that generates dart classes from json strings
Stars: ✭ 836 (+4544.44%)
Mutual labels: json
Ason
[DEPRECATED]: Prefer Moshi, Jackson, Gson, or LoganSquare
Stars: ✭ 777 (+4216.67%)
Mutual labels: json
Yaml.js
Standalone JavaScript YAML 1.2 Parser & Encoder. Works under node.js and all major browsers. Also brings command line YAML/JSON conversion tools.
Stars: ✭ 810 (+4400%)
Mutual labels: json
Convert a HTML webpage to JSON data using a template defined in JSON.
Installation
This package is available on PyPi. Just use pip install -U html2json
to install it. Then you can import it using from html2json import collect
.
API
The method is collect(html, template)
. html
is the HTML of page loaded as string, and template
is the JSON of template loaded as Python objects.
Note that the HTML must contain the root node, like <html>...</html>
or <div>...</div>
.
Template Syntax
- The basic syntax is
keyName: [selector, attr, [listOfRegexes]]
.-
selector
is a CSS selector (supported by lxml).- When the selector is
null
, the root node itself is matched. - When the selector cannot be matched,
null
is returned.
- When the selector is
-
attr
matches the attribute value. It can benull
to match either the inner text or the outer text when the inner text is empty. - The list of regexes
[listOfRegexes]
supports two forms of regex operations. The operations with in the list are executed sequentially.- Replacement:
s/regex/replacement/g
.g
is optional for multiple replacements. - Extraction:
/regex/
.
- Replacement:
-
For example:
{
"Color": ["head link:nth-of-type(1)", "href", ["/\\w+(?=\\.css)/"]],
}
- As JSON, nested structure can be easily constructed.
{
"Cover": {
"URL": [".cover img", "src", []],
"Number of Favorites": [".cover .favorites", "value", []]
},
}
- An alternative simplified syntax
keyName: [subRoot, subTemplate]
can be used.-
subRoot
a CSS selector of the new root for each sub entry. -
subTemplate
is a sub-template for each entry, recursively.
-
For example, the previous example can be simplified as follow.
{
"Cover": [".cover", {
"URL": ["img", "src", []],
"Number of Favorites": [".favorites", "value", []]
}],
}
- To extract a list of sub-entries following the same sub-template, the list syntax is
keyName: [[subRoot, subTemplate]]
. Please note the difference (surrounding[
and]
) from the previous syntax above.-
subRoot
is the CSS selector of the new root for each sub entry. -
subTemplate
is the sub-template for each entry, recursively.
-
For example:
{
"Comments": [[".comments", {
"From": [".from", null, []],
"Content": [".content", null, []],
"Photos": [["img", {
"URL": ["", "src", []]
}]]
}]]
}
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].