All Projects → clipperhouse → uax29

clipperhouse / uax29

Licence: MIT License
A tokenizer based on Unicode text segmentation (UAX 29), for Go

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to uax29

unzalgo
Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.
Stars: ✭ 38 (+46.15%)
Mutual labels:  unicode
unicode-emoji-json
Emoji data from unicode.org as easily consumable JSON files.
Stars: ✭ 149 (+473.08%)
Mutual labels:  unicode
contour
Modern C++ Terminal Emulator
Stars: ✭ 761 (+2826.92%)
Mutual labels:  unicode
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (+23.08%)
Mutual labels:  tokenization
NetUnicodeInfo
Unicode Character Inspector & Library providing a subset of the Unicode data for .NET clients.
Stars: ✭ 42 (+61.54%)
Mutual labels:  unicode
auto-data-tokenize
Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow
Stars: ✭ 21 (-19.23%)
Mutual labels:  tokenization
stringx
Drop-in replacements for base R string functions powered by stringi
Stars: ✭ 14 (-46.15%)
Mutual labels:  unicode
ngx-emoj
A simple, theme-able emoji mart/picker for angular 4+
Stars: ✭ 18 (-30.77%)
Mutual labels:  unicode
prettype
An easy to use text stylizer for your desktop!
Stars: ✭ 14 (-46.15%)
Mutual labels:  unicode
unidecode
Elixir package to transliterate Unicode to ASCII
Stars: ✭ 18 (-30.77%)
Mutual labels:  unicode
gpprofile2017
Gpprof with unicode support and new features.
Stars: ✭ 60 (+130.77%)
Mutual labels:  unicode
TTyGO
VT220 serial terminal for Arduino
Stars: ✭ 22 (-15.38%)
Mutual labels:  unicode
Wordle2Townscaper
Wordle2Townscaper is meant to convert Wordle tweets into Townscaper houses using yellow and green building blocks.
Stars: ✭ 64 (+146.15%)
Mutual labels:  unicode
bert tokenization for java
This is a java version of Chinese tokenization descried in BERT.
Stars: ✭ 39 (+50%)
Mutual labels:  tokenization
rm-emoji-picker
A modern, ES2015 emoji picker and editor.
Stars: ✭ 76 (+192.31%)
Mutual labels:  unicode
2048-rs
Rust implementation of 2048 game
Stars: ✭ 15 (-42.31%)
Mutual labels:  unicode
UniObfuscator
Java obfuscator that hides code in comment tags and Unicode garbage by making use of Java's Unicode escapes.
Stars: ✭ 40 (+53.85%)
Mutual labels:  unicode
StringConvert
A simple C++11 based helper for converting string between a various charset
Stars: ✭ 16 (-38.46%)
Mutual labels:  unicode
umoji
😄 A lib convert emoji unicode to Surrogate pairs
Stars: ✭ 68 (+161.54%)
Mutual labels:  unicode
youtokentome-ruby
High performance unsupervised text tokenization for Ruby
Stars: ✭ 17 (-34.62%)
Mutual labels:  tokenization

This package tokenizes words, sentences and graphemes, based on Unicode text segmentation (UAX 29), for Unicode version 13.0.0.

Usage

import "github.com/clipperhouse/uax29/words"

text := "It’s not “obvious” (IMHO) what comprises a word, a sentence, or a grapheme. 👍🏼🐶!"
reader := strings.NewReader(text)

scanner := words.NewScanner(reader)

// Scan returns true until error or EOF
for scanner.Scan() {
	fmt.Printf("%q\n", scanner.Text())
}

// Gotta check the error (because we depend on I/O).
if err := scanner.Err(); err != nil {
	log.Fatal(err)
}

GoDoc

Why tokenize?

Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. Best to do it consistently.

Conformance

We use the official test suites, thanks to bleve. Status:

Go

Performance

uax29 is designed to work in constant memory, regardless of input size. It buffers input and streams tokens. (For example, I am showing a maximum resident size of 8MB when processing a 300MB file.)

Execution time is O(n) on input size. It can be I/O bound; I/O performance is determined by the io.Reader you pass to NewScanner.

In my local benchmarking (Mac laptop), uax29/words processes around 25MM tokens per second, or 90MB/s, of multi-lingual prose.

Status

  • The word boundary rules have been implemented in the words package

  • The sentence boundary rules have been implemented in the sentences package

  • The grapheme cluster rules have been implemented in the graphemes package

  • The official test suite passes for words, sentences, and graphemes

  • We code-gen the Unicode categories relevant to UAX 29 by running go generate at the repository root

  • There is discussion of implementing the above in Go’s x/text package

Invalid inputs

Invalid UTF-8 input is undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

There are two tests in each package, called TestInvalidUTF8 and TestRandomBytes. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.

See also

jargon, a text pipelines package for CLI and Go, which consumes this package.

Prior art

blevesearch/segment

rivo/uniseg

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].