All Projects → syntax-tree → hast-util-to-mdast

syntax-tree / hast-util-to-mdast

Licence: MIT license
utility to transform hast (HTML) to mdast (markdown)

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to hast-util-to-mdast

mdast-util-to-hast
utility to transform mdast to hast
Stars: ✭ 53 (+103.85%)
Mutual labels:  unist, mdast, hast, hast-util, mdast-util
mdast-util-to-string
utility to get the plain text content of an mdast node
Stars: ✭ 27 (+3.85%)
Mutual labels:  unist, mdast, mdast-util
hast-util-to-html
utility to serialize hast to HTML
Stars: ✭ 47 (+80.77%)
Mutual labels:  unist, hast, hast-util
hast-util-sanitize
utility to sanitize hast nodes
Stars: ✭ 34 (+30.77%)
Mutual labels:  unist, hast, hast-util
hast-util-reading-time
utility to estimate the reading time
Stars: ✭ 55 (+111.54%)
Mutual labels:  unist, hast, hast-util
hast-util-from-dom
utility to transform a DOM tree to hast
Stars: ✭ 20 (-23.08%)
Mutual labels:  unist, hast, hast-util
unist-util-map
utility to create a new tree by mapping all nodes
Stars: ✭ 30 (+15.38%)
Mutual labels:  unist, unist-util
MarkdownSyntax
☄️ A Type-safe Markdown parser in Swift.
Stars: ✭ 65 (+150%)
Mutual labels:  unist, mdast
unist-util-select
utility to select unist nodes with CSS-like selectors
Stars: ✭ 41 (+57.69%)
Mutual labels:  unist, unist-util
remark-rehype
plugin that turns markdown into HTML to support rehype
Stars: ✭ 118 (+353.85%)
Mutual labels:  mdast, hast
hast-util-select
utility to add `querySelector`, `querySelectorAll`, and `matches` support for hast
Stars: ✭ 20 (-23.08%)
Mutual labels:  hast, hast-util
unist-util-visit-parents
utility to recursively walk over unist nodes, with ancestral information
Stars: ✭ 25 (-3.85%)
Mutual labels:  unist, unist-util
unist-util-inspect
utility to inspect nodes
Stars: ✭ 16 (-38.46%)
Mutual labels:  unist, unist-util
unist-builder
utility to create a new trees with a nice syntax
Stars: ✭ 52 (+100%)
Mutual labels:  unist, unist-util
Unified
☔️ interface for parsing, inspecting, transforming, and serializing content through syntax trees
Stars: ✭ 3,036 (+11576.92%)
Mutual labels:  unist
jsdast
JSDoc Abstract Syntax Tree
Stars: ✭ 20 (-23.08%)
Mutual labels:  unist
sast
Parse CSS, Sass, SCSS, and Less into a unist syntax tree
Stars: ✭ 51 (+96.15%)
Mutual labels:  unist
remark-retext
plugin to transform from remark (Markdown) to retext (natural language)
Stars: ✭ 18 (-30.77%)
Mutual labels:  mdast
xast
Extensible Abstract Syntax Tree
Stars: ✭ 32 (+23.08%)
Mutual labels:  unist
remark-slate-transformer
remark plugin to transform remark syntax tree (mdast) to Slate document tree, and vice versa. Made for WYSIWYG markdown editor.
Stars: ✭ 62 (+138.46%)
Mutual labels:  mdast

hast-util-to-mdast

Build Coverage Downloads Size Sponsors Backers Chat

hast utility to transform to mdast.

Contents

What is this?

This package is a utility that takes a hast (HTML) syntax tree as input and turns it into an mdast (markdown) syntax tree.

When should I use this?

This project is useful when you want to turn HTML to markdown.

The mdast utility mdast-util-to-hast does the inverse of this utility. It turns markdown into HTML.

The rehype plugin rehype-remark wraps this utility to also turn HTML to markdown at a higher-level (easier) abstraction.

Install

This package is ESM only. In Node.js (version 14.14+ and 16.0+), install with npm:

npm install hast-util-to-mdast

In Deno with esm.sh:

import {toMdast} from 'https://esm.sh/hast-util-to-mdast@9'

In browsers with esm.sh:

<script type="module">
  import {toMdast} from 'https://esm.sh/hast-util-to-mdast@9?bundle'
</script>

Use

Say we have the following example.html:

<h2>Hello <strong>world!</strong></h2>

…and next to it a module example.js:

import fs from 'node:fs/promises'
import {fromHtml} from 'hast-util-from-html'
import {toMdast} from 'hast-util-to-mdast'
import {toMarkdown} from 'mdast-util-to-markdown'

const html = String(await fs.readFile('example.html'))
const hast = fromHtml(html, {fragment: true})
const mdast = toMdast(hast)
const markdown = toMarkdown(mdast)

console.log(markdown)

…now running node example.js yields:

## Hello **world!**

API

This package exports the identifiers defaultHandlers, defaultNodeHandlers, and toMdast. There is no default export.

toMdast(tree[, options])

Transform hast to mdast.

Parameters
  • tree (HastNode) — hast tree to transform
  • options (Options, optional) — configuration
Returns

mdast tree (MdastNode).

defaultHandlers

Default handlers for elements (Record<string, Handle>).

Each key is an element name, each value is a Handle.

defaultNodeHandlers

Default handlers for nodes (Record<string, NodeHandle>).

Each key is a node type, each value is a NodeHandle.

Handle

Handle a particular element (TypeScript type).

Parameters
  • state (State) — info passed around about the current state
  • element (Element) — element to transform
  • parent (HastParent) — parent of element
Returns

mdast node or nodes (MdastNode | Array<MdastNode> | void).

NodeHandle

Handle a particular node (TypeScript type).

Parameters
  • state (State) — info passed around about the current state
  • node (any) — node to transform
  • parent (HastParent) — parent of node
Returns

mdast node or nodes (MdastNode | Array<MdastNode> | void).

Options

Configuration (TypeScript type).

Fields
newlines

Keep line endings when collapsing whitespace (boolean, default: false).

The default collapses to a single space.

checked

Value to use for a checked checkbox or radio input (string, default: [x]).

unchecked

Value to use for an unchecked checkbox or radio input (string, default: [ ]).

quotes

List of quotes to use (Array<string>, default: ['"']).

Each value can be one or two characters. When two, the first character determines the opening quote and the second the closing quote at that level. When one, both the opening and closing quote are that character.

The order in which the preferred quotes appear determines which quotes to use at which level of nesting. So, to prefer ‘’ at the first level of nesting, and “” at the second, pass ['‘’', '“”']. If <q>s are nested deeper than the given amount of quotes, the markers wrap around: a third level of nesting when using ['«»', '‹›'] should have double guillemets, a fourth single, a fifth double again, etc.

document

Whether the given tree represents a complete document (boolean?, default: undefined).

Applies when the tree is a root node. When the tree represents a complete document, then things are wrapped in paragraphs when needed, and otherwise they’re left as-is. The default checks for whether there’s mixed content: some phrasing nodes and some non-phrasing nodes.

handlers

Object mapping tag names to functions handling the corresponding elements (Record<string, Handle>).

Merged into the defaults. See Handle.

nodeHandlers

Object mapping node types to functions handling the corresponding nodes (Record<string, NodeHandle>).

Merged into the defaults. See NodeHandle.

State

Info passed around about the current state (TypeScript type).

Fields
  • patch ((from: HastNode, to: MdastNode) => void) — copy a node’s positional info
  • one ((node: HastNode, parent: HastParent | undefined) => MdastNode | Array<MdastNode> | void) — transform a hast node to mdast
  • all ((parent: HastParent) => Array<MdastContent>) — transform the children of a hast parent to mdast
  • toFlow ((nodes: Array<MdastContent>) => Array<MdastFlowContent>) — transform a list of mdast nodes to flow
  • toSpecificContent (<ParentType>(nodes: Array<MdastContent>, build: (() => ParentType)) => Array<ParentType>) — turn arbitrary content into a list of a particular node type
  • resolve ((url: string | null | undefined) => string) — resolve a URL relative to a base
  • options (Options) — user configuration
  • elementById (Map<string, Element>) — elements by their id
  • handlers (Record<string, Handle>) — applied element handlers (see Handle)
  • nodeHandlers (Record<string, NodeHandle>) — applied node handlers (see NodeHandle)
  • baseFound (boolean) — whether a <base> element was seen
  • frozenBaseUrl (string | undefined) — href of <base>, if any
  • inTable (boolean) — whether we’re in a table
  • qNesting (number) — how deep we’re in <q>s

Examples

Example: ignoring things

It’s possible to exclude something from within HTML when turning it into markdown, by wrapping it in an element with a data-mdast attribute set to 'ignore'. For example:

<p><strong>Strong</strong> and <em data-mdast="ignore">emphasis</em>.</p>

Yields:

**Strong** and .

It’s also possible to pass a handler to ignore nodes. For example, to ignore em elements, pass handlers: {'em': function () {}}:

<p><strong>Strong</strong> and <em>emphasis</em>.</p>

Yields:

**Strong** and .

Example: keeping some HTML

The goal of this project is to map HTML to plain and readable markdown. That means that certain elements are ignored (such as <svg>) or “downgraded” (such as <video> to links). You can change this by passing handlers.

Say we have the following file example.html:

<p>
  Some text with
  <svg viewBox="0 0 1 1" width="1" height="1"><rect fill="black" x="0" y="0" width="1" height="1" /></svg>
  a graphic… Wait is that a dead pixel?
</p>

This can be achieved with example.js like so:

/**
 * @typedef {import('mdast').HTML} HTML
 */

import fs from 'node:fs/promises'
import {fromHtml} from 'hast-util-from-html'
import {toMdast} from 'hast-util-to-mdast'
import {toHtml} from 'hast-util-to-html'
import {toMarkdown} from 'mdast-util-to-markdown'

const html = String(await fs.readFile('example.html'))
const hast = fromHtml(html, {fragment: true})
const mdast = toMdast(hast, {
  handlers: {
    svg(state, node) {
      /** @type {HTML} */
      const result = {type: 'html', value: toHtml(node, {space: 'svg'})}
      state.patch(node, result)
      return result
    }
  }
})
const markdown = toMarkdown(mdast)

console.log(markdown)

Yields:

Some text with <svg viewBox="0 0 1 1" width="1" height="1"><rect fill="black" x="0" y="0" width="1" height="1"></rect></svg> a graphic… Wait is that a dead pixel?

Algorithm

The algorithm used in this project is very powerful. It supports all HTML elements, including ancient elements (xmp) and obscure ones (base). It’s particularly good at forms, media, and around implicit and explicit paragraphs (see HTML Standard, A. van Kesteren; et al. WHATWG § 3.2.5.4 Paragraphs), such as:

<article>
  An implicit paragraph.
  <h1>An explicit paragraph.</h1>
</article>

Yields:

An implicit paragraph.

# An explicit paragraph.

Syntax

HTML is handled according to WHATWG HTML (the living standard), which is also followed by browsers such as Chrome and Firefox.

This project creates markdown according to GFM, which is a standard that’s based on CommonMark but adds the strikethrough (~like so~) and tables (| Table header | …) amongst some alternative syntaxes.

Syntax tree

The input syntax tree format is hast. Any HTML that can be represented in hast is accepted as input. The output syntax tree format is mdast.

When <table> elements or <del>, <s>, and <strike> exist in the HTML, then the GFM nodes table and delete are used. This utility does not generate definitions or references, or syntax extensions such as footnotes, frontmatter, or math.

Types

This package is fully typed with TypeScript. It exports the additional types Handle, NodeHandle, Options, and State.

Compatibility

Projects maintained by the unified collective are compatible with all maintained versions of Node.js. As of now, that is Node.js 14.14+ and 16.0+. Our projects sometimes work with older versions, but this is not guaranteed.

Security

Use of hast-util-to-mdast is safe by default.

Related

Contribute

See contributing.md in syntax-tree/.github for ways to get started. See support.md for ways to get help.

This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.

License

MIT © Titus Wormer

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].