All Projects → acrazing → html5parser

acrazing / html5parser

Licence: MIT license
A super tiny and fast html5 AST parser.

Programming Languages

typescript
32286 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to html5parser

AdvancedHTMLParser
Fast Indexed python HTML parser which builds a DOM node tree, providing common getElementsBy* functions for scraping, testing, modification, and formatting. Also XPath.
Stars: ✭ 90 (-41.18%)
Mutual labels:  dom, html-parser
rehype-dom
HTML processor to parse and compile with browser APIs, powered by plugins
Stars: ✭ 20 (-86.93%)
Mutual labels:  dom, ast
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (+50.98%)
Mutual labels:  dom, html-parser
Nativejsx
JSX to native DOM API transpilation. 💛 <div> ⟹ document.createElement('div')!
Stars: ✭ 145 (-5.23%)
Mutual labels:  dom, ast
Hyntax
Straightforward HTML parser for JavaScript
Stars: ✭ 84 (-45.1%)
Mutual labels:  dom, html-parser
Didom
Simple and fast HTML and XML parser
Stars: ✭ 1,939 (+1167.32%)
Mutual labels:  dom, html-parser
html-parser
A simple and general purpose html/xhtml parser, using Pest.
Stars: ✭ 56 (-63.4%)
Mutual labels:  dom, html-parser
Htmlparser2
The fast & forgiving HTML and XML parser
Stars: ✭ 3,299 (+2056.21%)
Mutual labels:  dom, html-parser
Lua Gumbo
Moved to https://gitlab.com/craigbarnes/lua-gumbo
Stars: ✭ 116 (-24.18%)
Mutual labels:  dom, html-parser
Minimize
Minimize HTML
Stars: ✭ 150 (-1.96%)
Mutual labels:  dom, html-parser
Lunasvg
A standalone c++ library to create, animate, manipulate and render SVG files.
Stars: ✭ 243 (+58.82%)
Mutual labels:  dom
hypercomponent
⚡ Fast and light component system, backed by hyperHTML
Stars: ✭ 45 (-70.59%)
Mutual labels:  dom
Respo
A virtual DOM library built with ClojureScript, inspired by React and Reagent.
Stars: ✭ 230 (+50.33%)
Mutual labels:  dom
Pugixml
Light-weight, simple and fast XML parser for C++ with XPath support
Stars: ✭ 2,809 (+1735.95%)
Mutual labels:  dom
hast-util-from-dom
utility to transform a DOM tree to hast
Stars: ✭ 20 (-86.93%)
Mutual labels:  dom
Lite Virtual List
Virtual list component library supporting waterfall flow based on vue
Stars: ✭ 223 (+45.75%)
Mutual labels:  dom
Angular Ru Interview Questions
Вопросы на собеседовании по Angular
Stars: ✭ 224 (+46.41%)
Mutual labels:  dom
InDiv
an angular like web mvvm framework.一个类 angular 前端框架。https://dimalilongji.github.io/InDiv
Stars: ✭ 88 (-42.48%)
Mutual labels:  dom
string-dom
Create HTML strings using JSX (or functions).
Stars: ✭ 13 (-91.5%)
Mutual labels:  dom
React Scroll Sync
Synced scroll position across multiple scrollable elements
Stars: ✭ 252 (+64.71%)
Mutual labels:  dom

html5parser

html5parser is a super fast and tiny HTML5 parser.

Highlights

  • Fast: maybe the fastest one you can find on GitHub.
  • Tiny: the fully packaged bundle size is less than 5kb.
  • Cross platform: works in the modern browsers and Node.js.
  • HTML5 only: any thing not in the specification will be ignored.
  • Accurate: every token could be located in source file.

Table of Contents

Installation

  1. Package manager

    npm i -S html5parser
    
    # or var yarn
    yarn add html5parser
  2. CDN

    <script src="https://unpkg.com/html5parser@latest/dist/html5parser.umd.js"></script>

Quick start

Edit html5parser - quick start

import { parse, walk, SyntaxKind } from 'html5parser';

const ast = parse('<!DOCTYPE html><head><title>Hello html5parser!</title></head></html>');

walk(ast, {
  enter: (node) => {
    if (node.type === SyntaxKind.Tag && node.name === 'title' && Array.isArray(node.body)) {
      const text = node.body[0];
      if (text.type !== SyntaxKind.Text) {
        return;
      }
      const div = document.createElement('div');
      div.innerHTML = `The title of the input is <strong>${text.value}</strong>`;
      document.body.appendChild(div);
    }
  },
});

API Reference

tokenize(input)

Low level API to parse string to tokens:

function tokenize(input: string): IToken[];
  • IToken

    interface IToken {
      start: number;
      end: number;
      value: string;
      type: TokenKind;
    }
  • TokenKind

    const enum TokenKind {
      Literal,
      OpenTag, // trim leading '<'
      OpenTagEnd, // trim tailing '>', only could be '/' or ''
      CloseTag, // trim leading '</' and tailing '>'
      Whitespace, // the whitespace between attributes
      AttrValueEq,
      AttrValueNq,
      AttrValueSq,
      AttrValueDq,
    }

parse(input)

Core API to parse string to AST:

function parse(input: string, options?: ParseOptions): INode[];
  • ParseOptions

    interface ParseOptions {
      // create tag's attributes map
      // if true, will set ITag.attributeMap property
      // as a `Record<string, IAttribute>`
      setAttributeMap: boolean;
    }
  • INode

    export type INode = IText | ITag;
  • ITag

    export interface ITag extends IBaseNode {
      type: SyntaxKind.Tag;
      // original open tag, <Div id="id">
      open: IText;
      // lower case tag name, div
      name: string;
      // original case tag name, Div
      rawName: string;
      attributes: IAttribute[];
      // the attribute map, if `options.setAttributeMap` is `true`
      // this will be a Record, key is the attribute name literal,
      // value is the attribute self.
      attributeMap: Record<string, IAttribute> | undefined;
      body:
        | Array<ITag | IText> // with close tag
        | undefined // self closed
        | null; // EOF before open tag end
      // original close tag, </DIV >
      close:
        | IText // with close tag
        | undefined // self closed
        | null; // EOF before end or without close tag
    }
  • IAttribute

    export interface IAttribute extends IBaseNode {
      name: IText;
      value: IAttributeValue | undefined;
    }
  • IAttributeValue

    export interface IAttributeValue extends IBaseNode {
      value: string;
      quote: "'" | '"' | undefined;
    }
  • IText

    export interface IText extends IBaseNode {
      type: SyntaxKind.Text;
      value: string;
    }
  • IBaseNode

    export interface IBaseNode {
      start: number;
      end: number;
    }
  • SyntaxKind

    export enum SyntaxKind {
      Text = 'Text',
      Tag = 'Tag',
    }

walk(ast, options)

Visit all the nodes of the AST with specified callbacks:

function walk(ast: INode[], options: WalkOptions): void;
  • IWalkOptions

    export interface IWalkOptions {
      enter?(node: INode, parent: INode | void, index: number): void;
      leave?(node: INode, parent: INode | void, index: number): void;
    }

safeHtml(input)

Parse input to AST and keep the tags and attributes by whitelists, and then print it to a string.

function safeHtml(input: string, options?: Partial<SafeHtmlOptions>): string;

  • SafeHtmlOptions

    export interface SafeHtmlOptions {
      allowedTags: string[];
      allowedAttrs: string[];
      tagAllowedAttrs: Record<string, string[]>;
      allowedUrl: RegExp;
    }

safeHtmlDefaultOptions

The default options of safeHtml, you can modify it, its effect is global.

const safeHtmlDefaultOptions: SafeHtmlOptions;

Warnings

This is use for HTML5, that means:

  1. All tags like <? ... ?>, <! ... > (except for <!doctype ...>, case insensitive) is treated as Comment, that means CDATASection is treated as comment.
  2. Special tag names:
  • "!doctype" (case insensitive), the doctype declaration
  • "!": short comment
  • "!--": normal comment
  • ""(empty string): short comment, for <? ... >, the leading ? is treated as comment content

Benchmark

Thanks for htmlparser-benchmark, I created a pull request at pulls/7, and its result on my MacBook Pro is:

$ npm test

> [email protected] test ~/htmlparser-benchmark
> node execute.js

gumbo-parser failed (exit code 1)
high5 failed (exit code 1)

html-parser        : 28.6524 ms/file ± 21.4282

html5              : 130.423 ms/file ± 161.478

html5parser        : 2.37975 ms/file ± 3.30717

htmlparser         : 16.6576 ms/file ± 109.840

htmlparser2-dom    : 3.45602 ms/file ± 5.05830

htmlparser2        : 2.61135 ms/file ± 4.33535
hubbub failed (exit code 1)
libxmljs failed (exit code 1)

neutron-html5parser: 2.89331 ms/file ± 2.94316
parse5 failed (exit code 1)

sax                : 10.2110 ms/file ± 13.5204

License

The MIT License (MIT)

Copyright (c) 2020 acrazing

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].