All Projects → luxcem → Apifier

luxcem / Apifier

Licence: lgpl-3.0
Apifier is a very simple HTML parser written in Python based on CSS selectors

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Apifier

Html Agility Pack
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Stars: ✭ 2,014 (+40180%)
Mutual labels:  parse, html-parser
Modest
Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
Stars: ✭ 572 (+11340%)
Mutual labels:  html-parser, css-selector
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (+4520%)
Mutual labels:  parse, html-parser
Floki
Floki is a simple HTML parser that enables search for nodes using CSS selectors.
Stars: ✭ 1,642 (+32740%)
Mutual labels:  html-parser, css-selector
modest ex
Elixir library to do pipeable transformations on html strings (with CSS selectors)
Stars: ✭ 31 (+520%)
Mutual labels:  css-selector, html-parser
Nginxparser
Parses nginx configuration with Pyparsing — Used in Letsencrypt
Stars: ✭ 489 (+9680%)
Mutual labels:  parse
Remarkable
Markdown parser, done right. Commonmark support, extensions, syntax plugins, high speed - all in one. Gulp and metalsmith plugins available. Used by Facebook, Docusaurus and many others! Use https://github.com/breakdance/breakdance for HTML-to-markdown conversion. Use https://github.com/jonschlinkert/markdown-toc to generate a table of contents.
Stars: ✭ 5,252 (+104940%)
Mutual labels:  parse
Pidgin
C#'s fastest parser combinator library
Stars: ✭ 469 (+9280%)
Mutual labels:  parse
Fullstack Javascript
Source code for the Fullstack JavaScript book
Stars: ✭ 456 (+9020%)
Mutual labels:  parse
Micromark
the smallest commonmark compliant markdown parser that exists; new basis for @unifiedjs (hundreds of projects w/ billions of downloads for dealing w/ content)
Stars: ✭ 793 (+15760%)
Mutual labels:  parse
Surgeon
Declarative DOM extraction expression evaluator. 👨‍⚕️
Stars: ✭ 653 (+12960%)
Mutual labels:  css-selector
Nom
Rust parser combinator framework
Stars: ✭ 5,987 (+119640%)
Mutual labels:  parse
Schm
Composable schemas for JavaScript and Node.js
Stars: ✭ 498 (+9860%)
Mutual labels:  parse
Yauaa
Yet Another UserAgent Analyzer
Stars: ✭ 472 (+9340%)
Mutual labels:  parse
Leasot
Parse and output TODOs and FIXMEs from comments in your files
Stars: ✭ 729 (+14480%)
Mutual labels:  parse
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+9180%)
Mutual labels:  css-selector
Php
Parser for PHP written in Go
Stars: ✭ 516 (+10220%)
Mutual labels:  parse
Unitsnet
Makes life working with units of measurement just a little bit better.
Stars: ✭ 641 (+12720%)
Mutual labels:  parse
Html Parser
php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍
Stars: ✭ 510 (+10100%)
Mutual labels:  html-parser
Parsepy
A relatively up-to-date fork of ParsePy, the Python wrapper for the Parse.com API. Originally maintained by @dgrtwo
Stars: ✭ 509 (+10080%)
Mutual labels:  parse

Apifier

Apifier is a very simple HTML parser written in Python.

It aims to parse HTML documents in a declarative way using css or xpath selectors. Its main purpose is to parse tabular and/or paginated data.

Install

Apifier is available for python 3

Build Status codecov PyPI version

pip install apifier

Example

Getting all comments from an article at "LeFigaro.fr"

from apifier import Apifier

config = {
    "name": "FigaroBot article comments",
    "encoding": "latin-1",
    "url": "http://www.lefigaro.fr/politique/le-scan/2016/07/21/25001-20160721ARTFIG00062-attentat-de-nice-la-droite-demande-une-enquete-independante.php",
    "foreach": "#fig-pagination-nav > li > a",
    "context": "page",
    "xpath": False,
    "prefix": "#reagir > div > div > div.fig-col.fig-col--comments > div:nth-child(3) > ul > li > article >",
    "description": {
        "author": "div.fig-comment-header a",
        "comment": "div.fig-comment-msg p"
    }
}

api = Apifier(config=config)
data = api.load()

Config

  • name : name of the current configuration
  • encoding : is the encoding the page is using, data will be converted from this encoding to utf-8 for sanity
  • url : page url, first page in case of paginated data
  • xpath: boolean, set to true if selectors are xpath instead of css
  • next : selector for a "next" link, apifier will crawl pages with next link until none is found
  • foreach : selector for the pagination links int this example pagination looks like :
    <ul id="fig-pagination-nav">
      <li class="fig-pagination-current"><a href="…"> 1 </a></li>
      <li><a href="…"> 2 </a></li>
      <li><a href="…"> 3 </a></li>
    </ul>
    
  • context : each data will be associated with a special variable named after the content of the pagination link in this case, this content is just the page number, but the pagination mechanism can be used for othher purpose like categories
  • prefix : descriptors will be prefixed by this option
  • description : descriptor for content to parse, in this example, comment content and author name.

To use xpath selector instead of css write them prefixed by a $.

The result is :

    data =
    [
        {'comment': "…", 'author': '…', 'page': '1'}, etc
    ]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].