Adding UTF-X decoding support to the OCaml Stdlib

This has been merged upstream and will be part of OCaml 4.14

Plays around improving this old proposal. Especially on Alain's comment to avoid exceptions.

For now we simply focus on providing an API in Bytes, the rest can follow easily and/or later.

We introduce the abstract type Uchar.utf_decode for decode results which is just an int value that has the decoded Uchar.t (or Uchar.rep in case of decode error) in the lower bits and information about the decode in bits above Uchar.max. The scheme is compatible with both 32-bit and 64-bit platforms, see the implementation for more details. No exception have to be introduced and the API does not allocate on decodes.

The resulting API seems nice to use. examples.ml show how Uutf-like folders and Seq based iterators can easily be derived in a few lines.

It is also fool proof w.r.t. to best-effort decodes, Uchar.rep will be appropriately added on decoding errors as it should be for security reasons. See the Seq iterator example and more information on this below.

A few implementation are provided which differ only in their UTF-8 treatment:

utf_x_adhoc.ml, uses a 256 bytes string to index the cases mentionned in TUS Table 3.7 and then performs some ugly ad-hoc decoding. This ugly code is mine :-)
utf_x_if.ml, is similar to adhoc, but uses if branches instead of a table, the result is a bit harder to comprehend.
utf_x_pat.ml, is similar to adhoc but lets my favourite compiler work out the dispatch table itself by using pattern matching on byte ranges. Yay !
utf_x_dfa.ml, uses the UTF-8 decoding MIT-licensed DFA devised by Bjoern Hoehrmann here. This uses a 364 bytes string for the DFA and makes the decoding code much more elegant (if I hadn't to rewrite it to an imperative loop to make it more efficient).

Rough benchmarks made with perf8.ml and measured via time on my machine seems to indicate we have, ordered by most performant: pat, if, adhoc and dfa.

The first three being quite close to each other (within 10%). Comparing the fastest to the slowest, the pat version takes the following percentage of the time taken by the dfa version.

62% Markus Kuhn's UTF-8 demo. A mixture of things.
62% Hindi wikipedia articles multistream dataset. Exercises ASCII and Devanagari decodes which are three bytes long in UTF-8.
58% Croatian wikipedia articles multistream dataset. Exercises ASCII and Latin Extended-A decodes which are two bytes long in UTF-8.
56% on this README, mainly ASCII.

So I think the best candidate for upstreaming would be pat.

Best-effort decodes and U+FFFD replacements

The API also tries to take into account recommended strategies for U+FFFD replacement on best-effort decoding. For a discussion about these see here.

If nothing special is done, e.g. like the Seq iterator in examples.ml the resulting API should behave like the "state machine view". Supposedly this is what is mandated by the WHATWG Encoding standard, if you can infer it through their crazy prose code specifications – TUS has a description of it on page 126 of the version 14.0.0 text if you can remember all the precise jargon definitions :-) I believe that a short and concise spec of all that is:

In case of decoding error return an U+FFFE, if you error on the first byte, consume the byte, otherwise consume the bytes preceeding the erroring byte.

The API provides enough information on decodes to implement other replacement modes like one Uchar.rep per bogus byte.

The urep-test.html and urep-spec.html files taken from the above mentioned page can be used to test the WHATWG encoding standard behaviour (note that Uutf does not respect this).

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

dbuenzli / stdlib-utf

Programming Languages

Adding UTF-X decoding support to the OCaml Stdlib

Best-effort decodes and U+FFFD replacements