Adding UTF-X decoding support to the OCaml Stdlib
This has been merged upstream and will be part of OCaml 4.14
Plays around improving this old proposal. Especially on Alain's comment to avoid exceptions.
For now we simply focus on providing an API in Bytes
, the rest can
follow easily and/or later.
We introduce the abstract type Uchar.utf_decode
for decode results
which is just an int
value that has the decoded Uchar.t
(or
Uchar.rep
in case of decode error) in the lower bits and information
about the decode in bits above Uchar.max
. The scheme is compatible
with both 32-bit and 64-bit platforms, see the implementation for more
details. No exception have to be introduced and the API does not
allocate on decodes.
The resulting API seems nice to
use. examples.ml
show how Uutf
-like
folders and Seq
based iterators can easily be derived in
a few lines.
It is also fool proof w.r.t. to best-effort decodes, Uchar.rep
will
be appropriately added on decoding errors as it should be for security
reasons. See the Seq
iterator example and more information on this
below.
A few implementation are provided which differ only in their UTF-8 treatment:
utf_x_adhoc.ml
, uses a 256 bytes string to index the cases mentionned in TUS Table 3.7 and then performs some ugly ad-hoc decoding. This ugly code is mine :-)utf_x_if.ml
, is similar to adhoc, but usesif
branches instead of a table, the result is a bit harder to comprehend.utf_x_pat.ml
, is similar to adhoc but lets my favourite compiler work out the dispatch table itself by using pattern matching on byte ranges. Yay !utf_x_dfa.ml
, uses the UTF-8 decoding MIT-licensed DFA devised by Bjoern Hoehrmann here. This uses a 364 bytes string for the DFA and makes the decoding code much more elegant (if I hadn't to rewrite it to an imperative loop to make it more efficient).
Rough benchmarks made with perf8.ml
and measured via
time
on my machine seems to indicate we have, ordered by most
performant: pat
, if
, adhoc
and dfa
.
The first three being quite close to each other (within
10%). Comparing the fastest to the slowest, the pat
version takes
the following percentage of the time taken by the dfa
version.
- 62% Markus Kuhn's UTF-8 demo. A mixture of things.
- 62% Hindi wikipedia articles multistream dataset. Exercises ASCII and Devanagari decodes which are three bytes long in UTF-8.
- 58% Croatian wikipedia articles multistream dataset. Exercises ASCII and Latin Extended-A decodes which are two bytes long in UTF-8.
- 56% on this README, mainly ASCII.
So I think the best candidate for upstreaming would be pat
.
Best-effort decodes and U+FFFD replacements
The API also tries to take into account recommended strategies for U+FFFD replacement on best-effort decoding. For a discussion about these see here.
If nothing special is done, e.g. like the Seq
iterator in
examples.ml
the resulting API should behave like the
"state machine view". Supposedly this is what is mandated by the
WHATWG Encoding standard, if you can infer it
through their crazy prose code specifications – TUS has a description
of it on page 126 of the version 14.0.0 text if you can
remember all the precise jargon definitions :-) I believe that a short and
concise spec of all that is:
In case of decoding error return an U+FFFE, if you error on the first byte, consume the byte, otherwise consume the bytes preceeding the erroring byte.
The API provides enough information on decodes to implement other
replacement modes like one Uchar.rep
per bogus byte.
The urep-test.html
and
urep-spec.html
files taken from the above
mentioned page can be used to test the WHATWG encoding standard
behaviour (note that Uutf
does not respect this).