This package tokenizes words, sentences and graphemes, based on Unicode text segmentation (UAX 29), for Unicode version 13.0.0.
Usage
import "github.com/clipperhouse/uax29/words"
text := "It’s not “obvious” (IMHO) what comprises a word, a sentence, or a grapheme. 👍🏼🐶!"
reader := strings.NewReader(text)
scanner := words.NewScanner(reader)
// Scan returns true until error or EOF
for scanner.Scan() {
fmt.Printf("%q\n", scanner.Text())
}
// Gotta check the error (because we depend on I/O).
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
Why tokenize?
Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. Best to do it consistently.
Conformance
We use the official test suites, thanks to bleve. Status:
Performance
uax29
is designed to work in constant memory, regardless of input size. It buffers input and streams tokens. (For example, I am showing a maximum resident size of 8MB when processing a 300MB file.)
Execution time is O(n)
on input size. It can be I/O bound; I/O performance is determined by the io.Reader
you pass to NewScanner
.
In my local benchmarking (Mac laptop), uax29/words
processes around 25MM tokens per second, or 90MB/s, of multi-lingual prose.
Status
-
The word boundary rules have been implemented in the
words
package -
The sentence boundary rules have been implemented in the
sentences
package -
The grapheme cluster rules have been implemented in the
graphemes
package -
The official test suite passes for words, sentences, and graphemes
-
We code-gen the Unicode categories relevant to UAX 29 by running
go generate
at the repository root -
There is discussion of implementing the above in Go’s
x/text
package
Invalid inputs
Invalid UTF-8 input is undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.
There are two tests in each package, called TestInvalidUTF8
and TestRandomBytes
. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.
See also
jargon, a text pipelines package for CLI and Go, which consumes this package.