All Projects → rushter → Selectolax

rushter / Selectolax

Licence: mit
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Selectolax

Html5 Dom Document Php
A better HTML5 parser for PHP.
Stars: ✭ 477 (+29.62%)
Mutual labels:  parser, html5
Sillynium
Automate the creation of Python Selenium Scripts by drawing coloured boxes on webpage elements
Stars: ✭ 100 (-72.83%)
Mutual labels:  web-scraping, html5
Lua Gumbo
Moved to https://gitlab.com/craigbarnes/lua-gumbo
Stars: ✭ 116 (-68.48%)
Mutual labels:  parser, html5
Latex.js
JavaScript LaTeX to HTML5 translator
Stars: ✭ 374 (+1.63%)
Mutual labels:  parser, html5
Sequential
An environment to visualize JavaScript code execution in a browser
Stars: ✭ 74 (-79.89%)
Mutual labels:  parser, html5
Save For Offline
Android app for saving webpages for offline reading.
Stars: ✭ 114 (-69.02%)
Mutual labels:  parser, web-scraping
Parse5
HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.
Stars: ✭ 2,778 (+654.89%)
Mutual labels:  parser, html5
Wolfbot
Crypto currency trading bot written in TypeScript for NodeJS
Stars: ✭ 335 (-8.97%)
Mutual labels:  html5
Go Shellwords
Parse line as shell words
Stars: ✭ 355 (-3.53%)
Mutual labels:  parser
Js Quantities
JavaScript library for quantity calculation and unit conversion
Stars: ✭ 335 (-8.97%)
Mutual labels:  parser
Scalc
📲 A simple calculator application
Stars: ✭ 336 (-8.7%)
Mutual labels:  html5
Phaser3 Docs
Phaser 3 Documentation and TypeScript Defs
Stars: ✭ 339 (-7.88%)
Mutual labels:  html5
Schemalex
Generate difference sql of two mysql schema
Stars: ✭ 356 (-3.26%)
Mutual labels:  parser
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+1007.88%)
Mutual labels:  web-scraping
Mercury Parser
📜 Extract meaningful content from the chaos of a web page
Stars: ✭ 4,025 (+993.75%)
Mutual labels:  parser
Hlsltools
A Visual Studio extension that provides enhanced support for editing High Level Shading Language (HLSL) files
Stars: ✭ 336 (-8.7%)
Mutual labels:  parser
Csv Parser
A modern C++ library for reading, writing, and analyzing CSV (and similar) files.
Stars: ✭ 359 (-2.45%)
Mutual labels:  parser
Awesome Postcss
A curate list about PostCSS
Stars: ✭ 360 (-2.17%)
Mutual labels:  parser
Taro
A lightweight 3D game engine for the web.
Stars: ✭ 345 (-6.25%)
Mutual labels:  html5
Mescroll
精致的下拉刷新和上拉加载 js框架.支持vue,完美运行于移动端和主流PC浏览器 (JS framework for pull-refresh and pull-up-loading)
Stars: ✭ 3,775 (+925.82%)
Mutual labels:  html5

.. image:: docs/logo.png :alt: selectolax logo


.. image:: https://img.shields.io/pypi/v/selectolax.svg :target: https://pypi.python.org/pypi/selectolax

A fast HTML5 parser with CSS selectors using Modest engine <https://github.com/lexborisov/Modest/>_.

Installation

From PyPI using pip:

.. code-block:: bash

    pip install selectolax 

Development version from github:

.. code-block:: bash

    git clone --recursive  https://github.com/rushter/selectolax
    cd selectolax
    pip install -r requirements_dev.txt
    python setup.py install

How to compile selectolax while developing:

.. code-block:: bash

make clean
make dev

Basic examples

.. code:: python

In [1]: from selectolax.parser import HTMLParser
   ...:
   ...: html = """
   ...: <h1 id="title" data-updated="20201101">Hi there</h1>
   ...: <div class="post">Lorem Ipsum is simply dummy text of the printing and typesetting industry. </div>
   ...: <div class="post">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</div>
   ...: """
   ...: tree = HTMLParser(html)

In [2]: tree.css_first('h1#title').text()
Out[2]: 'Hi there'

In [3]: tree.css_first('h1#title').attributes
Out[3]: {'id': 'title', 'data-updated': '20201101'}

In [4]: [node.text() for node in tree.css('.post')]
Out[4]:
['Lorem Ipsum is simply dummy text of the printing and typesetting industry. ',
 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.']

.. code:: python

In [1]: html = "<div><p id=p1><p id=p2><p id=p3><a>link</a><p id=p4><p id=p5>text<p id=p6></div>"
   ...: selector = "div > :nth-child(2n+1):not(:has(a))"

In [2]: for node in HTMLParser(html).css(selector):
   ...:     print(node.attributes, node.text(), node.tag)
   ...:     print(node.parent.tag)
   ...:     print(node.html)
   ...:
{'id': 'p1'}  p
div
<p id="p1"></p>
{'id': 'p5'} text p
div
<p id="p5">text</p>
  • Detailed overview <https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb>_

Simple Benchmark

  • Average of 10 experiments to parse and retrieve URLs from 800 Google SERP pages.

+------------+------------+--------------+ | Package | Time | Memory (peak)| +============+============+==============+ | selectolax | 2.38 sec. | 768.11 MB | +------------+------------+--------------+ | lxml | 18.67 sec. | 769.21 MB | +------------+------------+--------------+

Links

  • selectolax API reference <http://selectolax.readthedocs.io/en/latest/parser.html>_
  • Detailed overview <https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb>_
  • Modest introduction <https://lexborisov.github.io/Modest/>_
  • Modest benchmark <http://lexborisov.github.io/benchmark-html-persers/>_
  • Python benchmark <https://rushter.com/blog/python-fast-html-parser/>_
  • Another Python benchmark <https://www.peterbe.com/plog/selectolax-or-pyquery>_

License

  • Modest engine — LGPL2.1 <https://github.com/lexborisov/Modest/blob/master/LICENSE>_
  • selectolax - MIT <https://github.com/rushter/selectolax/blob/master/LICENSE>_
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].