All Projects → tapeinosyne → hyphenation

tapeinosyne / hyphenation

Licence: Apache-2.0, MIT licenses found Licenses found Apache-2.0 LICENSE-APACHE MIT LICENSE-MIT
Text hyphenation for Rust

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to hyphenation

Tehreer-Android
Standalone text engine for Android aimed to be free from platform limitations
Stars: ✭ 61 (+41.86%)
Mutual labels:  unicode
confusables
A nodejs library for removing confusable unicode characters from strings.
Stars: ✭ 50 (+16.28%)
Mutual labels:  unicode
emoji-db
A database of Apple-supported emojis in JSON format. Used by my Alfred emoji workflow.
Stars: ✭ 32 (-25.58%)
Mutual labels:  unicode
unicode-blocks
Unicode Blocks of a Ruby String
Stars: ✭ 18 (-58.14%)
Mutual labels:  unicode
unicode-9.0.0
JavaScript-compatible Unicode data. Arrays of code points, arrays of symbols, and regular expressions for Unicode v9.0.0’s categories, scripts, blocks, bidi, and other properties.
Stars: ✭ 16 (-62.79%)
Mutual labels:  unicode
php-typography
A PHP library for improving your web typography.
Stars: ✭ 63 (+46.51%)
Mutual labels:  hyphenation
unicode.net
A Unicode library for .NET, supporting UTF8, UTF16, and UTF32. With an extra helping of emoji for good measure 🔥🌶️😁
Stars: ✭ 81 (+88.37%)
Mutual labels:  unicode
unicode
A Flask-Based Web-App for Exploring Unicode
Stars: ✭ 12 (-72.09%)
Mutual labels:  unicode
ocreval
Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support
Stars: ✭ 48 (+11.63%)
Mutual labels:  unicode
glyphhanger
Your web font utility belt. It can subset web fonts. It can find unicode-ranges for you automatically. It makes julienne fries.
Stars: ✭ 422 (+881.4%)
Mutual labels:  unicode
elokab-terminal
Lightweight terminal emulator program that supports the Arabic language
Stars: ✭ 16 (-62.79%)
Mutual labels:  unicode
arrow-finder
These docs help you to find and use arrows you need more quickly
Stars: ✭ 24 (-44.19%)
Mutual labels:  unicode
unicode-data
Temporary holding place for my suggestions for future version of Unicode data files. Report bugs to https://www.unicode.org/reporting.html
Stars: ✭ 18 (-58.14%)
Mutual labels:  unicode
cs string
Header-only library providing unicode aware string support for C++
Stars: ✭ 91 (+111.63%)
Mutual labels:  unicode
ruby-homograph-detector
🕵️‍♀️🕵️‍♂️ Ruby gem for determining whether a given URL is considered an IDN homograph attack
Stars: ✭ 29 (-32.56%)
Mutual labels:  unicode
jurl
Fast and simple URL parsing for Java, with UTF-8 and path resolving support
Stars: ✭ 84 (+95.35%)
Mutual labels:  unicode
unicode-lookup
The web's best unicode lookup tool!
Stars: ✭ 49 (+13.95%)
Mutual labels:  unicode
ara
ع Command line tool that displays Arabic text in terminal.
Stars: ✭ 27 (-37.21%)
Mutual labels:  unicode
unigem-objective-c
Unicode Gems, a Mac app, an iOS app, and an iOS keyboard for letter-like unicode.
Stars: ✭ 22 (-48.84%)
Mutual labels:  unicode
unicode display width
Displayed width of UTF-8 strings in Modern C++
Stars: ✭ 30 (-30.23%)
Mutual labels:  unicode

hyphenation

Hyphenation for UTF-8 strings in a variety of languages.

[dependencies]
hyphenation = "0.8.3"

Two strategies are available:

Documentation

Docs.rs

Usage

Quickstart

The hyphenation library relies on hyphenation dictionaries, external files that must be loaded into memory. To start with, however, it can be more convenient to embed them in the compiled artifact.

[dependencies]
hyphenation = { version = "0.8.3", features = ["embed_all"] }

The topmost module of hyphenation offers a small prelude that can be imported to expose the most common functionality.

use hyphenation::*;

// Retrieve the embedded American English dictionary for `Standard` Knuth-Liang hyphenation.
let en_us = Standard::from_embedded(Language::EnglishUS) ?;

// Identify valid breaks in the given word.
let hyphenated = en_us.hyphenate("hyphenation");

// Word breaks are represented as byte indices into the string.
let break_indices = &hyphenated.breaks;
assert_eq!(break_indices, &[2, 6, 7]);

// The segments of a hyphenated word can be iterated over, marked or unmarked.
let marked = hyphenated.iter();
let collected : Vec<String> = marked.collect();
assert_eq!(collected, vec!["hy-", "phen-", "a-", "tion"]);

let unmarked = hyphenated.iter().segments();
let collected : Vec<&str> = unmarked.collect();
assert_eq!(collected, vec!["hy", "phen", "a", "tion"]);

// `hyphenate()` is case-insensitive.
let uppercase : Vec<_> = en_us.hyphenate("CAPITAL").into_iter().segments().collect();
assert_eq!(uppercase, vec!["CAP", "I", "TAL"]);

Loading dictionaries at runtime

The current set of available dictionaries amounts to ~2.8MB of data. Although embedding them is an option, most applications should prefer to load individual dictionaries at runtime, like so:

let path_to_dict = "/path/to/en-us.bincode";
let english_us = Standard::from_path(Language::EnglishUS, path_to_dict) ?;

Dictionaries bundled with hyphenation can be retrieved from the build folder under target, and packaged with the final application as desired.

$ find target -name "dictionaries"
target/debug/build/hyphenation-33034db3e3b5f3ce/out/dictionaries

Segmentation

Dictionaries can be used in conjunction with text segmentation to hyphenate words within a text run. This short example uses the unicode-segmentation crate for untailored Unicode segmentation.

use unicode_segmentation::UnicodeSegmentation;

let hyphenate_text = |text : &str| -> String {
    // Split the text on word boundaries—
    text.split_word_bounds()
        // —and hyphenate each word individually.
        .flat_map(|word| en_us.hyphenate(word).into_iter())
        .collect()
};

let excerpt = "I know noble accents / And lucid, inescapable rhythms; […]";
assert_eq!("I know no-ble ac-cents / And lu-cid, in-escapable rhythms; […]"
          , hyphenate_text(excerpt));

Normalization

Hyphenation patterns for languages affected by normalization occasionally cover multiple forms, at the discretion of their authors, but most often they don’t. If you require hyphenation to operate strictly on strings in a known normalization form, as described by the Unicode Standard Annex #15 and provided by the unicode-normalization crate, you may specify it in your Cargo manifest, like so:

[dependencies.hyphenation]
version = "0.8.3"
features = ["nfc"]

The features field may contain exactly one of the following normalization options:

  • "nfc", for canonical composition;
  • "nfd", for canonical decomposition;
  • "nfkc", for compatibility composition;
  • "nfkd", for compatibility decomposition.

You may prefer to build hyphenation in release mode if normalization is enabled, since the bundled hyphenation patterns will need to be reprocessed into dictionaries.

License

hyphenation © 2016 tapeinosyne, dual-licensed under the terms of either:

  • the Apache License, Version 2.0
  • the MIT license

hyph-utf8 hyphenation patterns © their respective owners; see their master files for licensing information.

patterns/hyph-hu.ext.txt (extended Hungarian hyphenation patterns) is licensed under:

  • MPL 1.1 (refer to patterns/hyph-hu.ext.lic.txt)

patterns/hyph-ca.ext.txt (extended Catalan hyphenation patterns) is licensed under:

  • LGPL v.3.0 or higher (refer to patterns/hyph-ca.ext.lic.txt)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].