All Projects → aantron → Markup.ml

aantron / Markup.ml

Licence: mit
Error-recovering streaming HTML5 and XML parsers

Programming Languages

ocaml
1615 projects

Projects that are alternatives of or similar to Markup.ml

Myflix
Myflix, a Netflix clone!
Stars: ✭ 260 (+113.11%)
Mutual labels:  streaming, html5
Rx Player
DASH/Smooth HTML5 Video Player
Stars: ✭ 600 (+391.8%)
Mutual labels:  streaming, html5
Loaders.gl
Loaders for big data visualization. Website:
Stars: ✭ 272 (+122.95%)
Mutual labels:  xml, streaming
Prettydiff
Beautifier and language aware code comparison tool for many languages. It also minifies and a few other things.
Stars: ✭ 1,635 (+1240.16%)
Mutual labels:  xml, html5
Soupsieve
A modern CSS selector implementation for BeautifulSoup
Stars: ✭ 95 (-22.13%)
Mutual labels:  xml, html5
Mmark
Mmark: a powerful markdown processor in Go geared towards the IETF
Stars: ✭ 313 (+156.56%)
Mutual labels:  xml, html5
Stream Parser
⚡ PHP7 / Laravel Multi-format Streaming Parser
Stars: ✭ 391 (+220.49%)
Mutual labels:  xml, streaming
Sheetjs
📗 SheetJS Community Edition -- Spreadsheet Data Toolkit
Stars: ✭ 28,479 (+23243.44%)
Mutual labels:  xml, html5
Mediaelement
HTML5 <audio> or <video> player with support for MP4, WebM, and MP3 as well as HLS, Dash, YouTube, Facebook, SoundCloud and others with a common HTML5 MediaElement API, enabling a consistent UI in all browsers.
Stars: ✭ 7,767 (+6266.39%)
Mutual labels:  streaming, html5
Macsvg
macSVG - An open-source macOS app for designing HTML5 SVG (Scalable Vector Graphics) art and animation with a WebKit web view ➤➤➤
Stars: ✭ 789 (+546.72%)
Mutual labels:  xml, html5
Streaming Room
Streaming room in Node.js, rtmp, hsl, html5 videojs player
Stars: ✭ 106 (-13.11%)
Mutual labels:  streaming, html5
Hls.js
HLS.js is a JavaScript library that plays HLS in browsers with support for MSE.
Stars: ✭ 10,791 (+8745.08%)
Mutual labels:  streaming, html5
Twital
Twital is a "plugin" for Twig that adds some sugar syntax, which makes its templates similar to PHPTal or VueJS.
Stars: ✭ 116 (-4.92%)
Mutual labels:  xml, html5
Phaser Kinetic Scrolling Plugin
Kinetic Scrolling plugin for Canvas using Phaser Framework
Stars: ✭ 117 (-4.1%)
Mutual labels:  html5
Rtp
A Go implementation of RTP
Stars: ✭ 120 (-1.64%)
Mutual labels:  streaming
Flexlib
FlexLib是一个基于flexbox模型,使用xml文件进行界面布局的框架,融合了web快速布局的能力,让iOS界面开发像写网页一样简单快速
Stars: ✭ 1,569 (+1186.07%)
Mutual labels:  xml
React Form With Constraints
Simple form validation for React
Stars: ✭ 117 (-4.1%)
Mutual labels:  html5
Persistentstreamplayer
Stream audio over http, and persist the data to a local file while buffering
Stars: ✭ 120 (-1.64%)
Mutual labels:  streaming
Saxerator
A SAX-based XML parser for parsing large files into manageable chunks
Stars: ✭ 119 (-2.46%)
Mutual labels:  xml
Lemminx
XML Language Server
Stars: ✭ 117 (-4.1%)
Mutual labels:  xml

Markup.ml   Travis status Coverage

Markup.ml is a pair of parsers implementing the HTML5 and XML specifications, including error recovery. Usage is simple, because each parser is a function from byte streams to parsing signal streams:

Usage example

In addition to being error-correcting, the parsers are:

  • streaming: parsing partial input and emitting signals while more input is still being received;
  • lazy: not parsing input unless you have requested the next parsing signal, so you can easily stop parsing partway through a document;
  • non-blocking: they can be used with Lwt, but still provide a straightforward synchronous interface for simple usage; and
  • one-pass: memory consumption is limited since the parsers don't build up a document representation, nor buffer input beyond a small amount of lookahead.

The parsers detect character encodings automatically, and emit everything in UTF-8. The HTML parser understands SVG and MathML, in addition to HTML5.

Here is a breakdown showing the signal stream and errors emitted during the parsing and pretty-printing of bad_html:

string bad_html         "<body><p><em>Markup.ml<p>rocks!"

|> parse_html           `Start_element "body"
|> signals              `Start_element "p"
                        `Start_element "em"
                        `Text ["Markup.ml"]
                        ~report (1, 10) (`Unmatched_start_tag "em")
                        `End_element                   (* </em>: recovery *)
                        `End_element                   (* </p>: not an error *)
                        `Start_element "p"
                        `Start_element "em"            (* recovery *)
                        `Text ["rocks!"]
                        `End_element                   (* </em> *)
                        `End_element                   (* </p> *)
                        `End_element                   (* </body> *)

|> pretty_print         (* adjusts the `Text signals *)

|> write_html
|> to_channel stdout;;  "...shown above..."            (* valid HTML *)

The parsers are tested thoroughly.

For a higher-level parser, see Lambda Soup, which is based on Markup.ml, but can search documents using CSS selectors, and perform various manipulations.


Overview and basic usage

The interface is centered around four functions between byte streams and signal streams: parse_html, write_html, parse_xml, and write_xml. These have several optional arguments for fine-tuning their behavior. The rest of the functions either input or output byte streams, or transform signal streams in some interesting way.

Here is an example with an optional argument:

(* Show up to 10 XML well-formedness errors to the user. Stop after
   the 10th, without reading more input. *)
let report =
  let count = ref 0 in
  fun location error ->
    error |> Error.to_string ~location |> prerr_endline;
    count := !count + 1;
    if !count >= 10 then raise_notrace Exit

file "some.xml" |> fst |> parse_xml ~report |> signals |> drain

Advanced: Cohttp + Markup.ml + Lambda Soup + Lwt

This program requests a Google search, then does a streaming scrape of result titles. It exits when it finds a GitHub link, without reading more input. Only one h3 element is converted into an in-memory tree at a time.

let () =
  Lwt_main.run begin
    (* Send request. Assume success. *)
    let url = "https://www.google.com/search?q=markup.ml" in
    let%lwt _, body = Cohttp_lwt_unix.Client.get (Uri.of_string url) in

    (* Adapt response to a Markup.ml stream. *)
    let body = body |> Cohttp_lwt.Body.to_stream |> Markup_lwt.lwt_stream in

    (* Set up a lazy stream of h3 elements. *)
    let h3s = Markup.(body
      |> strings_to_bytes |> parse_html |> signals
      |> elements (fun (_ns, name) _attrs -> name = "h3"))
    in

    (* Find the GitHub link. .iter and .load cause actual reading of data. *)
    h3s |> Markup_lwt.iter (fun h3 ->
      let%lwt h3 = Markup_lwt.load h3 in
      match Soup.(from_signals h3 $? "a[href*=github]") with
      | None -> Lwt.return_unit
      | Some anchor ->
        print_endline (String.concat "" (Soup.texts anchor));
        exit 0)
  end

This prints GitHub - aantron/markup.ml: Error-recovering streaming HTML5 and .... To run it, do:

ocamlfind opt -linkpkg -package lwt.ppx,cohttp.lwt,markup.lwt,lambdasoup \
    scrape.ml && ./a.out

You can get all the necessary packages by

opam install lwt_ssl
opam install cohttp-lwt-unix lambdasoup markup

Installing

opam install markup

Documentation

The interface of Markup.ml is three modules: Markup, Markup_lwt, and Markup_lwt_unix. The last two are available only if you have Lwt installed (OPAM package lwt).

The documentation includes a summary of the conformance status of Markup.ml.


Depending

Markup.ml uses semantic versioning, but is currently in 0.x.x. The minor version number will be incremented on breaking changes.


Contributing

Contributions are very much welcome. Please see CONTRIBUTING for instructions, suggestions, and an overview of the code. There is also a list of easy issues.


License

Markup.ml is distributed under the MIT license. The Markup.ml source distribution includes a copy of the HTML5 entity list, which is distributed under the W3C document license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].