All Projects → Aloso → Lexical Sort

Aloso / Lexical Sort

Licence: other
Sort Unicode strings lexicographically

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Lexical Sort

Crx Jtrans
jTransliter - the roman to unicode transliter as Google chrome extension
Stars: ✭ 13 (-43.48%)
Mutual labels:  unicode, transliteration
Slug Generator
Slug Generator Library for PHP, based on Unicode’s CLDR data
Stars: ✭ 740 (+3117.39%)
Mutual labels:  unicode, transliteration
Urlify
A fast PHP slug generator and transliteration library that converts non-ascii characters for use in URLs.
Stars: ✭ 633 (+2652.17%)
Mutual labels:  unicode, transliteration
Transliteration
UTF-8 to ASCII transliteration / slugify module for node.js, browser, Web Worker, React Native, Electron and CLI.
Stars: ✭ 444 (+1830.43%)
Mutual labels:  unicode, transliteration
Transliterate
Convert Unicode characters to Latin characters using transliteration
Stars: ✭ 152 (+560.87%)
Mutual labels:  unicode, transliteration
unihandecode
unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities
Stars: ✭ 71 (+208.7%)
Mutual labels:  unicode, transliteration
unidecode
Elixir package to transliterate Unicode to ASCII
Stars: ✭ 18 (-21.74%)
Mutual labels:  unicode, transliteration
Jsesc
Given some data, jsesc returns the shortest possible stringified & ASCII-safe representation of that data.
Stars: ✭ 600 (+2508.7%)
Mutual labels:  unicode
Unicodeplots.jl
Unicode-based scientific plotting for working in the terminal
Stars: ✭ 724 (+3047.83%)
Mutual labels:  unicode
Last Resort Font
Last Resort Font
Stars: ✭ 462 (+1908.7%)
Mutual labels:  unicode
Figures
Unicode symbols with Windows CMD fallbacks
Stars: ✭ 438 (+1804.35%)
Mutual labels:  unicode
Weird Fonts
𝑨 𝑱𝒂𝒗𝒂𝑺𝒄𝒓𝒊𝒑𝒕 𝒑𝒂𝒄𝒌𝒂𝒈𝒆 𝒕𝒉𝒂𝒕 𝒕𝒖𝒓𝒏 𝒂𝒍𝒑𝒉𝒂𝒏𝒖𝒎𝒆𝒓𝒊𝒄 𝒄𝒉𝒂𝒓𝒂𝒄𝒕𝒆𝒓𝒔 𝒊𝒏𝒕𝒐 𝒘𝒆𝒊𝒓𝒅 𝒇𝒐𝒏𝒕 𝒔𝒕𝒚𝒍𝒆.
Stars: ✭ 602 (+2517.39%)
Mutual labels:  unicode
Nepali Romanized Pro
Nepali Romanized Keyboard Layout with installer for macOS
Stars: ✭ 18 (-21.74%)
Mutual labels:  unicode
Pragmatapro
PragmataPro font is designed to help pros to work better
Stars: ✭ 887 (+3756.52%)
Mutual labels:  unicode
Awesome Unicode
😂 👌 A curated list of delightful Unicode tidbits, packages and resources.
Stars: ✭ 693 (+2913.04%)
Mutual labels:  unicode
Julia Vim
Vim support for Julia.
Stars: ✭ 556 (+2317.39%)
Mutual labels:  unicode
Unicopy
Unicode command-line codepoint dumper
Stars: ✭ 16 (-30.43%)
Mutual labels:  unicode
Ecoji
Encodes (and decodes) data as emojis
Stars: ✭ 671 (+2817.39%)
Mutual labels:  unicode
Unicode Types
Basic Unicode Types of a Ruby String
Stars: ✭ 5 (-78.26%)
Mutual labels:  unicode
Uni
Query the Unicode database from the commandline, with good support for emojis
Stars: ✭ 633 (+2652.17%)
Mutual labels:  unicode

lexical-sort

This is a library to compare and sort strings (or file paths) lexicographically. This means that non-ASCII characters such as á or ß are treated like their closest ASCII character: á is treated as a, ß is treated as ss, etc.

Lexical comparisons are case-insensitive. Alphanumeric characters are sorted after all other characters (punctuation, whitespace, special characters, emojis, ...).

It is possible to enable natural sorting, which also handles ASCII numbers. For example, 50 is less than 100 with natural sorting turned on. It's also possible to skip characters that aren't alphanumeric, so e.g. f-5 is next to f5.

If different strings have the same ASCII representation (e.g. "Foo" and "fóò"), it falls back to the default method from the standard library, so sorting is deterministic.

NOTE: This crate doesn't attempt to be correct for every locale, but it should work reasonably well for a wide range of locales, while providing excellent performance.

Usage

To sort strings or paths, you can use the StringSort or StringSort trait:

use lexical_sort::{StringSort, natural_lexical_cmp};

let mut strings = vec!["ß", "é", "100", "hello", "world", "50", ".", "B!"];
strings.string_sort_unstable(natural_lexical_cmp);

assert_eq!(&strings, &[".", "50", "100", "B!", "é", "hello", "ß", "world"]);

There are eight comparison functions:

Function lexico­graphical natural skips non-alphanumeric chars
cmp
only_alnum_cmp yes
lexical_cmp yes
lexical_only_alnum_cmp yes yes
natural_cmp yes
natural_only_alnum_cmp yes yes
natural_lexical_cmp yes yes
natural_lexical_­only_alnum_cmp yes yes yes

Note that only the functions that sort lexicographically are case insensitive.

Characteristics

All comparison functions constitute a total order. Two strings are only considered equal if they consist of exactly the same Unicode code points.

Performance

The algorithm uses iterators and never allocates memory on the heap. It is optimized for strings that consist mostly of ASCII characters; for ASCII-only strings, the lexicographical comparison functions are only 2 to 3 times as slow as the default method from std, which just compares Unicode code points.

Note that comparisons are slower for strings where many characters at the start are the same (after transliterating them to lowercase ASCII).

Benchmarks

These benchmarks were executed on an AMD A8-7600 Radeon R7 CPU with 4x 3.1GHz.

  • The first benchmark compares 100 randomly generated strings with 5 to 20 characters, containing both ASCII and non-ASCII characters.
  • The second benchmark also compares 100 randomly generated strings with 5 to 20 characters, but they're ASCII-only.
  • The last benchmark compares 100 randomly generated strings, each consisting of "T-" followed by 1 to 8 decimal digits. This is a stress test for natural sorting.

The first, grey bar is from the rust_icu crate that provides bindings to the icu C library. This performs "proper" collation for English. It is faster in many cases, because it generates search keys up front to reduce the total amount of work.

The last, dark blue bar is the string comparison function in the standard library.

Diagrams

no_std support

This crate supports no_std environments. Note that you have to disable default features to compile without the standard library.

This crate currently doesn't require an allocator, although this is likely going to change in the future.

Contributing

Contributions, bug reports and feature requests are welcome!

If support for certain characters is missing, you can contribute them to the any_ascii crate.

License

This project is dual-licensed under the MIT and Apache 2.0 license. Use whichever you prefer.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].