All Projects → fcevado → unidecode

fcevado / unidecode

Licence: other
Elixir package to transliterate Unicode to ASCII

Programming Languages

elixir
2628 projects

Projects that are alternatives of or similar to unidecode

Slug Generator
Slug Generator Library for PHP, based on Unicode’s CLDR data
Stars: ✭ 740 (+4011.11%)
Mutual labels:  unicode, ascii, transliteration
Transliteration
UTF-8 to ASCII transliteration / slugify module for node.js, browser, Web Worker, React Native, Electron and CLI.
Stars: ✭ 444 (+2366.67%)
Mutual labels:  unicode, ascii, transliteration
Urlify
A fast PHP slug generator and transliteration library that converts non-ascii characters for use in URLs.
Stars: ✭ 633 (+3416.67%)
Mutual labels:  unicode, ascii, transliteration
Weird Json
A collection of strange encoded JSONs. For connoisseurs.
Stars: ✭ 53 (+194.44%)
Mutual labels:  unicode, ascii
Lehar
Visualize data using relative ordering
Stars: ✭ 81 (+350%)
Mutual labels:  unicode, ascii
Diagon
Interactive ASCII art diagram generators. 🌟
Stars: ✭ 189 (+950%)
Mutual labels:  unicode, ascii
Lexical Sort
Sort Unicode strings lexicographically
Stars: ✭ 23 (+27.78%)
Mutual labels:  unicode, transliteration
Transliterate
Convert Unicode characters to Latin characters using transliteration
Stars: ✭ 152 (+744.44%)
Mutual labels:  unicode, transliteration
Unibits
Visualize different Unicode encodings in the terminal
Stars: ✭ 125 (+594.44%)
Mutual labels:  unicode, ascii
homoglyphs
Homoglyphs: get similar letters, convert to ASCII, detect possible languages and UTF-8 group.
Stars: ✭ 70 (+288.89%)
Mutual labels:  unicode, ascii
table2ascii
Python library for converting lists to fancy ASCII tables for displaying in the terminal and on Discord
Stars: ✭ 31 (+72.22%)
Mutual labels:  unicode, ascii
characteristics
Character info under different encodings
Stars: ✭ 25 (+38.89%)
Mutual labels:  unicode, ascii
attic
A collection of personal tiny tools - mirror of https://gitlab.com/hydrargyrum/attic
Stars: ✭ 17 (-5.56%)
Mutual labels:  unicode, ascii
Crx Jtrans
jTransliter - the roman to unicode transliter as Google chrome extension
Stars: ✭ 13 (-27.78%)
Mutual labels:  unicode, transliteration
Portable Utf8
🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.
Stars: ✭ 405 (+2150%)
Mutual labels:  unicode, ascii
Cowsay Files
A collection of additional/alternative cowsay files.
Stars: ✭ 216 (+1100%)
Mutual labels:  unicode, ascii
unihandecode
unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities
Stars: ✭ 71 (+294.44%)
Mutual labels:  unicode, transliteration
durdraw
Animated Unicode, ANSI and ASCII Art Editor for Linux/Unix/macOS
Stars: ✭ 55 (+205.56%)
Mutual labels:  unicode, ascii
memcached-php
Memcached client library in plain vanilla PHP.
Stars: ✭ 28 (+55.56%)
Mutual labels:  ascii
transliterasijawa
Javanese Transliteration (Nulisa Aksara Jawa)
Stars: ✭ 55 (+205.56%)
Mutual labels:  transliteration

Unidecode

An elixir implementation of Text::Unidecode a perl module to transliterate Unicode characters to US-ASCII.

It doesn't change encoding, as every string in Elixir, all results still are UTF8/Unicode characters. But are they are easy to convert to ASCII. Let's say you have the word código that is the portuguese word for code, and try to convert it to a charlist.

iex> to_charlist("código")
[99, 243, 100, 105, 103, 111]

Unicode is made to make this kind of operation give you better results.

iex> "código" |> Unidecode.decode |> to_charlist
'codigo'

This isn't the exact characters, but is readable and intelligible to anyone who speaks portuguese.

Design Philosophy(taken from original Unidecode perl library)

Unidecode's ability to transliterate from a given language is limited by two factors:

  • The amount and quality of data in the written form of the original language So if you have Hebrew data that has no vowel points in it, then Unidecode cannot guess what vowels should appear in a pronunciation. S f y hv n vwls n th npt, y wn't gt ny vwls n th tpt. (This is a specific application of the general principle of "Garbage In, Garbage Out".)

  • Basic limitations in the Unidecode design Writing a real and clever transliteration algorithm for any single language usually requires a lot of time, and at least a passable knowledge of the language involved. But Unicode text can convey more languages than I could possibly learn (much less create a transliterator for) in the entire rest of my lifetime. So I put a cap on how intelligent Unidecode could be, by insisting that it support only context-insensitive transliteration. That means missing the finer details of any given writing system, while still hopefully being useful.

Unidecode, in other words, is quick and dirty. Sometimes the output is not so dirty at all: Russian and Greek seem to work passably; and while Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing system, setting up a mapping from it to Roman letters seems to work pretty well. But sometimes the output is very dirty: Unidecode does quite badly on Japanese and Thai.

If you want a smarter transliteration for a particular language than Unidecode provides, then you should look for (or write) a transliteration algorithm specific to that language, and apply it instead of (or at least before) applying Unidecode.

In other words, Unidecode's approach is broad (knowing about dozens of writing systems), but shallow (not being meticulous about any of them).

Installation

Add unidecode to your depencies

def deps do
  [{:unidecode, "~> 1.0.0"}]
end

Changelog

Code of Conduct

License

Unidecode is under Apache v2.0 license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].