luxcem / Apifier
Licence: lgpl-3.0
Apifier is a very simple HTML parser written in Python based on CSS selectors
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Apifier
Html Agility Pack
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Stars: ✭ 2,014 (+40180%)
Mutual labels: parse, html-parser
Modest
Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
Stars: ✭ 572 (+11340%)
Mutual labels: html-parser, css-selector
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (+4520%)
Mutual labels: parse, html-parser
Floki
Floki is a simple HTML parser that enables search for nodes using CSS selectors.
Stars: ✭ 1,642 (+32740%)
Mutual labels: html-parser, css-selector
modest ex
Elixir library to do pipeable transformations on html strings (with CSS selectors)
Stars: ✭ 31 (+520%)
Mutual labels: css-selector, html-parser
Nginxparser
Parses nginx configuration with Pyparsing — Used in Letsencrypt
Stars: ✭ 489 (+9680%)
Mutual labels: parse
Remarkable
Markdown parser, done right. Commonmark support, extensions, syntax plugins, high speed - all in one. Gulp and metalsmith plugins available. Used by Facebook, Docusaurus and many others! Use https://github.com/breakdance/breakdance for HTML-to-markdown conversion. Use https://github.com/jonschlinkert/markdown-toc to generate a table of contents.
Stars: ✭ 5,252 (+104940%)
Mutual labels: parse
Fullstack Javascript
Source code for the Fullstack JavaScript book
Stars: ✭ 456 (+9020%)
Mutual labels: parse
Micromark
the smallest commonmark compliant markdown parser that exists; new basis for @unifiedjs (hundreds of projects w/ billions of downloads for dealing w/ content)
Stars: ✭ 793 (+15760%)
Mutual labels: parse
Surgeon
Declarative DOM extraction expression evaluator. 👨⚕️
Stars: ✭ 653 (+12960%)
Mutual labels: css-selector
Leasot
Parse and output TODOs and FIXMEs from comments in your files
Stars: ✭ 729 (+14480%)
Mutual labels: parse
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+9180%)
Mutual labels: css-selector
Unitsnet
Makes life working with units of measurement just a little bit better.
Stars: ✭ 641 (+12720%)
Mutual labels: parse
Html Parser
php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍
Stars: ✭ 510 (+10100%)
Mutual labels: html-parser
Parsepy
A relatively up-to-date fork of ParsePy, the Python wrapper for the Parse.com API. Originally maintained by @dgrtwo
Stars: ✭ 509 (+10080%)
Mutual labels: parse
Apifier
Apifier is a very simple HTML parser written in Python.
It aims to parse HTML documents in a declarative way using css or xpath selectors. Its main purpose is to parse tabular and/or paginated data.
Install
Apifier is available for python 3
pip install apifier
Example
Getting all comments from an article at "LeFigaro.fr"
from apifier import Apifier
config = {
"name": "FigaroBot article comments",
"encoding": "latin-1",
"url": "http://www.lefigaro.fr/politique/le-scan/2016/07/21/25001-20160721ARTFIG00062-attentat-de-nice-la-droite-demande-une-enquete-independante.php",
"foreach": "#fig-pagination-nav > li > a",
"context": "page",
"xpath": False,
"prefix": "#reagir > div > div > div.fig-col.fig-col--comments > div:nth-child(3) > ul > li > article >",
"description": {
"author": "div.fig-comment-header a",
"comment": "div.fig-comment-msg p"
}
}
api = Apifier(config=config)
data = api.load()
Config
- name : name of the current configuration
- encoding : is the encoding the page is using, data will be converted from this encoding to utf-8 for sanity
- url : page url, first page in case of paginated data
- xpath: boolean, set to true if selectors are xpath instead of css
- next : selector for a "next" link, apifier will crawl pages with next link until none is found
- foreach : selector for the pagination links int this example
pagination looks like :
<ul id="fig-pagination-nav"> <li class="fig-pagination-current"><a href="…"> 1 </a></li> <li><a href="…"> 2 </a></li> <li><a href="…"> 3 </a></li> </ul>
- context : each data will be associated with a special variable named after the content of the pagination link in this case, this content is just the page number, but the pagination mechanism can be used for othher purpose like categories
- prefix : descriptors will be prefixed by this option
- description : descriptor for content to parse, in this example, comment content and author name.
To use xpath selector instead of css write them prefixed by a $.
The result is :
data =
[
{'comment': "…", 'author': '…', 'page': '1'}, etc
]
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].