Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.

Stars: ✭ 231 (+175%)

Mutual labels: html-parser, dom

html-parser

A simple and general purpose html/xhtml parser, using Pest.

Stars: ✭ 56 (-33.33%)

Mutual labels: dom, html-parser

AdvancedHTMLParser

Fast Indexed python HTML parser which builds a DOM node tree, providing common getElementsBy* functions for scraping, testing, modification, and formatting. Also XPath.

Stars: ✭ 90 (+7.14%)

Mutual labels: dom, html-parser

html5parser

A super tiny and fast html5 AST parser.

Stars: ✭ 153 (+82.14%)

Mutual labels: dom, html-parser

Htmlparser2

The fast & forgiving HTML and XML parser

Stars: ✭ 3,299 (+3827.38%)

Mutual labels: html-parser, dom

Nito

A jQuery library for building user interfaces

Stars: ✭ 56 (-33.33%)

Mutual labels: dom

Oga

Read-only mirror of https://gitlab.com/yorickpeterse/oga

Stars: ✭ 1,147 (+1265.48%)

Mutual labels: html-parser

Browser Monkey

Reliable DOM testing

Stars: ✭ 53 (-36.9%)

Mutual labels: dom

Tokamak

SwiftUI-compatible framework for building browser apps with WebAssembly and native apps for other platforms

Stars: ✭ 1,083 (+1189.29%)

Mutual labels: dom

Anglesharp.js

👼 Extends AngleSharp with a .NET-based JavaScript engine.

Stars: ✭ 68 (-19.05%)

Mutual labels: dom

Scalajs Bootstrap

Scala.js bootstrap components

Stars: ✭ 55 (-34.52%)

Mutual labels: dom

React Faux Dom

DOM like structure that renders to React (unmaintained, archived)

Stars: ✭ 1,226 (+1359.52%)

Mutual labels: dom

Monoapp

choo architecture without a renderer

Stars: ✭ 52 (-38.1%)

Mutual labels: dom

Ng Focus On

A directive to make angular elements focusable

Stars: ✭ 51 (-39.29%)

Mutual labels: dom

Canvaskeyframes

最简单的序列帧动画canvas插件

Stars: ✭ 83 (-1.19%)

Mutual labels: dom

Sauron

Sauron is an html web framework for building web-apps. It is heavily inspired by elm.

Stars: ✭ 1,217 (+1348.81%)

Mutual labels: dom

Web Template

web-template.js 是一款基于 ES6 模板字符串解析的模板引擎。

Stars: ✭ 67 (-20.24%)

Mutual labels: dom

View All Similar Projects ➔

Hyntax

Straightforward HTML parser for JavaScript. Live Demo.

Simple. API is straightforward, output is clear.
Forgiving. Just like a browser, normally parses invalid HTML.
Supports streaming. Can process HTML while it's still being loaded.
No dependencies.

Usage
TypeScript Typings
Streaming
Tokens
AST Format
API Reference
Types Reference

Usage

npm install hyntax

const { tokenize, constructTree } = require('hyntax')
const util = require('util')

const inputHTML = `
<html>
  <body>
      <input type="text" placeholder="Don't type">
      <button>Don't press</button>
  </body>
</html>
`

const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)

console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))

TypeScript Typings

Hyntax is written in JavaScript but has integrated TypeScript typings to help you navigate around its data structures. There is also Types Reference which covers most common types.

Streaming

Use StreamTokenizer and StreamTreeConstructor classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.

const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')

http.get('http://info.cern.ch', (res) => {
  const streamTokenizer = new StreamTokenizer()
  const streamTreeConstructor = new StreamTreeConstructor()

  let resultTokens = []
  let resultAst

  res.pipe(streamTokenizer).pipe(streamTreeConstructor)

  streamTokenizer
    .on('data', (tokens) => {
      resultTokens = resultTokens.concat(tokens)
    })
    .on('end', () => {
      console.log(JSON.stringify(resultTokens, null, 2))
    })

  streamTreeConstructor
    .on('data', (ast) => {
      resultAst = ast
    })
    .on('end', () => {
      console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
    })
}).on('error', (err) => {
  throw err;
})

Tokens

Here are all kinds of tokens which Hyntax will extract out of HTML string.

Each token conforms to Tokenizer.Token interface.

AST Format

Resulting syntax tree will have at least one top-level Document Node with optional children nodes nested within.

{
  nodeType: TreeConstructor.NodeTypes.Document,
  content: {
    children: [
      {
        nodeType: TreeConstructor.NodeTypes.AnyNodeType,
        content: {…}
      },
      {
        nodeType: TreeConstructor.NodeTypes.AnyNodeType,
        content: {…}
      }
    ]
  }
}

Content of each node is specific to node's type, all of them are described in AST Node Types reference.

API Reference

Tokenizer

Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.

Interface

tokenize(html: String): Tokenizer.Result

Arguments

html
HTML string to process
Required.
Type: string.

Returns Tokenizer.Result

Tree Constructor

After you've got an array of tokens, you can pass them into tree constructor to build an AST.

Interface

constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result

Arguments

tokens
Array of tokens received from the tokenizer.
Required.
Type: Tokenizer.AnyToken[]

Returns TreeConstructor.Result

Types Reference

Tokenizer.Result

interface Result {
  state: Tokenizer.State
  tokens: Tokenizer.AnyToken[]
}

state
The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
tokens
Array of resulting tokens.
Type: Tokenizer.AnyToken[]

TreeConstructor.Result

interface Result {
  state: State
  ast: AST
}

state
The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
ast
Resulting AST.
Type: TreeConstructor.AST

Tokenizer.Token

Generic Token, other interfaces use it to create a specific Token type.

interface Token<T extends TokenTypes.AnyTokenType> {
  type: T
  content: string
  startPosition: number
  endPosition: number
}

type
One of the Token types.
content
Piece of original HTML string which was recognized as a token.
startPosition
Index of a character in the input HTML string where the token starts.
endPosition
Index of a character in the input HTML string where the token ends.

Tokenizer.TokenTypes.AnyTokenType

Shortcut type of all possible tokens.

type AnyTokenType =
  | Text
  | OpenTagStart
  | AttributeKey
  | AttributeAssigment
  | AttributeValueWrapperStart
  | AttributeValue
  | AttributeValueWrapperEnd
  | OpenTagEnd
  | CloseTag
  | OpenTagStartScript
  | ScriptTagContent
  | OpenTagEndScript
  | CloseTagScript
  | OpenTagStartStyle
  | StyleTagContent
  | OpenTagEndStyle
  | CloseTagStyle
  | DoctypeStart
  | DoctypeEnd
  | DoctypeAttributeWrapperStart
  | DoctypeAttribute
  | DoctypeAttributeWrapperEnd
  | CommentStart
  | CommentContent
  | CommentEnd

Tokenizer.AnyToken

Shortcut to reference any possible token.

type AnyToken = Token<TokenTypes.AnyTokenType>

TreeConstructor.AST

Just an alias to DocumentNode. AST always has one top-level DocumentNode. See AST Node Types

type AST = TreeConstructor.DocumentNode

AST Node Types

There are 7 possible types of Node. Each type has a specific content.

type DocumentNode = Node<NodeTypes.Document, NodeContents.Document>

type DoctypeNode = Node<NodeTypes.Doctype, NodeContents.Doctype>

type TextNode = Node<NodeTypes.Text, NodeContents.Text>

type TagNode = Node<NodeTypes.Tag, NodeContents.Tag>

type CommentNode = Node<NodeTypes.Comment, NodeContents.Comment>

type ScriptNode = Node<NodeTypes.Script, NodeContents.Script>

type StyleNode = Node<NodeTypes.Style, NodeContents.Style>

Interfaces for each content type:

Document
Doctype
Text
Tag
Comment
Script
Style

TreeConstructor.Node

Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.

interface Node<T extends NodeTypes.AnyNodeType, C extends NodeContents.AnyNodeContent> {
  nodeType: T
  content: C
}

TreeConstructor.NodeTypes.AnyNodeType

Shortcut type of all possible Node types.

type AnyNodeType =
  | Document
  | Doctype
  | Tag
  | Text
  | Comment
  | Script
  | Style

Node Content Types

TreeConstructor.NodeTypes.AnyNodeContent

Shortcut type of all possible types of content inside a Node.

type AnyNodeContent =
  | Document
  | Doctype
  | Text
  | Tag
  | Comment
  | Script
  | Style

TreeConstructor.NodeContents.Document

interface Document {
  children: AnyNode[]
}

TreeConstructor.NodeContents.Doctype

interface Doctype {
  start: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeStart>
  attributes?: DoctypeAttribute[]
  end: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeEnd>
}

TreeConstructor.NodeContents.Text

interface Text {
  value: Tokenizer.Token<Tokenizer.TokenTypes.Text>
}

TreeConstructor.NodeContents.Tag

interface Tag {
  name: string
  selfClosing: boolean
  openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStart>
  attributes?: TagAttribute[]
  openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEnd>
  children?: AnyNode[]
  close?: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>
}

TreeConstructor.NodeContents.Comment

interface Comment {
  start: Tokenizer.Token<Tokenizer.TokenTypes.CommentStart>
  value: Tokenizer.Token<Tokenizer.TokenTypes.CommentContent>
  end: Tokenizer.Token<Tokenizer.TokenTypes.CommentEnd>
}

TreeConstructor.NodeContents.Script

interface Script {
  openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartScript>
  attributes?: TagAttribute[]
  openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndScript>
  value: Tokenizer.Token<Tokenizer.TokenTypes.ScriptTagContent>
  close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagScript>
}

TreeConstructor.NodeContents.Style

interface Style {
  openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartStyle>,
  attributes?: TagAttribute[],
  openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndStyle>,
  value: Tokenizer.Token<Tokenizer.TokenTypes.StyleTagContent>,
  close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagStyle>
}

TreeConstructor.DoctypeAttribute

interface DoctypeAttribute {
  startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperStart>,
  value: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttribute>,
  endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperEnd>
}

TreeConstructor.TagAttribute

interface TagAttribute {
  key?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeKey>,
  startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperStart>,
  value?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValue>,
  endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperEnd>
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 84

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

mykolaharmash / Hyntax

Programming Languages

Labels

Projects that are alternatives of or similar to Hyntax

Hyntax

Table Of Contents

Usage

TypeScript Typings

Streaming

Tokens

AST Format

API Reference

Tokenizer

Interface

Arguments

Returns Tokenizer.Result

Tree Constructor

Interface

Arguments

Returns TreeConstructor.Result

Types Reference

Tokenizer.Result

TreeConstructor.Result

Tokenizer.Token

Tokenizer.TokenTypes.AnyTokenType

Tokenizer.AnyToken

TreeConstructor.AST

AST Node Types

TreeConstructor.Node

TreeConstructor.NodeTypes.AnyNodeType

Node Content Types

TreeConstructor.NodeTypes.AnyNodeContent

TreeConstructor.NodeContents.Document

TreeConstructor.NodeContents.Doctype

TreeConstructor.NodeContents.Text

TreeConstructor.NodeContents.Tag

TreeConstructor.NodeContents.Comment

TreeConstructor.NodeContents.Script

TreeConstructor.NodeContents.Style

TreeConstructor.DoctypeAttribute

TreeConstructor.TagAttribute