Alternatives and detailed information of uax29

clipperhouse / uax29

Licence: MIT License

A tokenizer based on Unicode text segmentation (UAX 29), for Go

Programming Languages

31211 projects - #10 most used programming language

Projects that are alternatives of or similar to uax29

unzalgo

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.

Stars: ✭ 38 (+46.15%)

Mutual labels: unicode

unicode-emoji-json

Emoji data from unicode.org as easily consumable JSON files.

Stars: ✭ 149 (+473.08%)

Mutual labels: unicode

contour

Modern C++ Terminal Emulator

Stars: ✭ 761 (+2826.92%)

Mutual labels: unicode

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (+23.08%)

Mutual labels: tokenization

NetUnicodeInfo

Unicode Character Inspector & Library providing a subset of the Unicode data for .NET clients.

Stars: ✭ 42 (+61.54%)

Mutual labels: unicode

auto-data-tokenize

Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow

Stars: ✭ 21 (-19.23%)

Mutual labels: tokenization

stringx

Drop-in replacements for base R string functions powered by stringi

Stars: ✭ 14 (-46.15%)

Mutual labels: unicode

ngx-emoj

A simple, theme-able emoji mart/picker for angular 4+

Stars: ✭ 18 (-30.77%)

Mutual labels: unicode

prettype

An easy to use text stylizer for your desktop!

Stars: ✭ 14 (-46.15%)

Mutual labels: unicode

unidecode

Elixir package to transliterate Unicode to ASCII

Stars: ✭ 18 (-30.77%)

Mutual labels: unicode

gpprofile2017

Gpprof with unicode support and new features.

Stars: ✭ 60 (+130.77%)

Mutual labels: unicode

TTyGO

VT220 serial terminal for Arduino

Stars: ✭ 22 (-15.38%)

Mutual labels: unicode

Wordle2Townscaper

Wordle2Townscaper is meant to convert Wordle tweets into Townscaper houses using yellow and green building blocks.

Stars: ✭ 64 (+146.15%)

Mutual labels: unicode

bert tokenization for java

This is a java version of Chinese tokenization descried in BERT.

Stars: ✭ 39 (+50%)

Mutual labels: tokenization

rm-emoji-picker

A modern, ES2015 emoji picker and editor.

Stars: ✭ 76 (+192.31%)

Mutual labels: unicode

2048-rs

Rust implementation of 2048 game

Stars: ✭ 15 (-42.31%)

Mutual labels: unicode

UniObfuscator

Java obfuscator that hides code in comment tags and Unicode garbage by making use of Java's Unicode escapes.

Stars: ✭ 40 (+53.85%)

Mutual labels: unicode

StringConvert

A simple C++11 based helper for converting string between a various charset

Stars: ✭ 16 (-38.46%)

Mutual labels: unicode

umoji

😄 A lib convert emoji unicode to Surrogate pairs

Stars: ✭ 68 (+161.54%)

Mutual labels: unicode

youtokentome-ruby

High performance unsupervised text tokenization for Ruby

Stars: ✭ 17 (-34.62%)

Mutual labels: tokenization

View All Similar Projects ➔

This package tokenizes words, sentences and graphemes, based on Unicode text segmentation (UAX 29), for Unicode version 13.0.0.

Usage

import "github.com/clipperhouse/uax29/words"

text := "It’s not “obvious” (IMHO) what comprises a word, a sentence, or a grapheme. 👍🏼🐶!"
reader := strings.NewReader(text)

scanner := words.NewScanner(reader)

// Scan returns true until error or EOF
for scanner.Scan() {
	fmt.Printf("%q\n", scanner.Text())
}

// Gotta check the error (because we depend on I/O).
if err := scanner.Err(); err != nil {
	log.Fatal(err)
}

GoDoc

Why tokenize?

Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. Best to do it consistently.

Conformance

We use the official test suites, thanks to bleve. Status:

Performance

uax29 is designed to work in constant memory, regardless of input size. It buffers input and streams tokens. (For example, I am showing a maximum resident size of 8MB when processing a 300MB file.)

Execution time is O(n) on input size. It can be I/O bound; I/O performance is determined by the io.Reader you pass to NewScanner.

In my local benchmarking (Mac laptop), uax29/words processes around 25MM tokens per second, or 90MB/s, of multi-lingual prose.

Status

The word boundary rules have been implemented in the words package
The sentence boundary rules have been implemented in the sentences package
The grapheme cluster rules have been implemented in the graphemes package
The official test suite passes for words, sentences, and graphemes
We code-gen the Unicode categories relevant to UAX 29 by running go generate at the repository root
There is discussion of implementing the above in Go’s x/text package

Invalid inputs

Invalid UTF-8 input is undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

There are two tests in each package, called TestInvalidUTF8 and TestRandomBytes. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.

Prior art

blevesearch/segment

rivo/uniseg

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

clipperhouse / uax29

Programming Languages

Labels

Projects that are alternatives of or similar to uax29

Usage

Why tokenize?

Conformance

Performance

Status

Invalid inputs

See also

Prior art