All Projects → krlmlr → enc

krlmlr / enc

Licence: other
A simple class for storing UTF-8 strings

Programming Languages

r
7636 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to enc

Awesome Unicode
😂 👌 A curated list of delightful Unicode tidbits, packages and resources.
Stars: ✭ 693 (+5675%)
Mutual labels:  utf-8
Stringz
💯 Super fast unicode-aware string manipulation Javascript library
Stars: ✭ 181 (+1408.33%)
Mutual labels:  utf-8
ocreval
Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support
Stars: ✭ 48 (+300%)
Mutual labels:  utf-8
Unicopy
Unicode command-line codepoint dumper
Stars: ✭ 16 (+33.33%)
Mutual labels:  utf-8
Voca rs
Voca_rs is the ultimate Rust string library inspired by Voca.js, string.py and Inflector, implemented as independent functions and on Foreign Types (String and str).
Stars: ✭ 167 (+1291.67%)
Mutual labels:  utf-8
Str
A fast, solid and strong typed string manipulation library with multibyte support
Stars: ✭ 199 (+1558.33%)
Mutual labels:  utf-8
Kibi
A text editor in ≤1024 lines of code, written in Rust
Stars: ✭ 522 (+4250%)
Mutual labels:  utf-8
S51 UTF 8 FontLibrary
UTF-8 font dot matrix data is saved through external FLASH
Stars: ✭ 30 (+150%)
Mutual labels:  utf-8
Encoding
Encoding Standard
Stars: ✭ 176 (+1366.67%)
Mutual labels:  utf-8
jurl
Fast and simple URL parsing for Java, with UTF-8 and path resolving support
Stars: ✭ 84 (+600%)
Mutual labels:  utf-8
Utf 8 Validate
Check if a buffer contains valid UTF-8
Stars: ✭ 78 (+550%)
Mutual labels:  utf-8
Unibits
Visualize different Unicode encodings in the terminal
Stars: ✭ 125 (+941.67%)
Mutual labels:  utf-8
Stringy
A PHP string manipulation library with multibyte support
Stars: ✭ 2,461 (+20408.33%)
Mutual labels:  utf-8
Imguicolortextedit
Colorizing text editor for ImGui
Stars: ✭ 772 (+6333.33%)
Mutual labels:  utf-8
gonvert
Golang character encoding converter with an automatic code-estimation.
Stars: ✭ 24 (+100%)
Mutual labels:  character-encoding
Tvision
A modern port of Turbo Vision 2.0, the classical framework for text-based user interfaces. Now cross-platform and with Unicode support.
Stars: ✭ 612 (+5000%)
Mutual labels:  utf-8
Netlink
Socket and Networking Library using msgpack.org[C++11]
Stars: ✭ 197 (+1541.67%)
Mutual labels:  utf-8
utf8-validator
UTF-8 Validator
Stars: ✭ 18 (+50%)
Mutual labels:  utf-8
utf utils
My work on high-speed conversion of UTF-8 to UTF-32/UTF-16
Stars: ✭ 45 (+275%)
Mutual labels:  utf-8
ShellAnything
ShellAnything is a C++ open-source software which allow one to easily customize and add new options to *Windows Explorer* context menu. Define specific actions when a user right-click on a file or a directory.
Stars: ✭ 103 (+758.33%)
Mutual labels:  utf-8

enc

Lifecycle: experimental rcc codecov CRAN_Status_Badge

Portable tools for UTF-8 character data

R and character encoding

The character encoding of determines the translation of the letters, digits, or other codepoints (atomic components of a text) into a sequence of bytes. A byte sequence may translate into valid text in one character encoding, but give nonsense in other character encodings.

For historic reasons, R can store strings in different ways:

  1. in the “native” encoding, the default encoding of the operating system
  2. in UTF-8, the most prevalent and versatile encoding nowadays
  3. in “latin1”, a popular encoding in Western Europe
  4. as “bytes”, leaving the interpretation to the user

On OS X and Linux, the “native” encoding is often UTF-8, but on Windows it is not. To add to the confusion, the encoding is a property of individual strings in a character vector, and not of the entire vector.

Why UTF-8?

When working with text, it is advisable to use UTF-8, because it allows encoding virtually any text, even in foreign languages that contain symbols that cannot be represented in your system’s native encoding. The UTF-8 encoding possesses several nice technical properties, and is by far the predominant encoding on the Web. Standardization on a “universal” encoding faciliates data exchange.

Because of R’s special handling of strings, some care must be taken to make sure that you’re actually using the UTF-8 encoding. Many functions in R will hide encoding issues from you, and transparently convert to UTF-8 as necessary. However, some functions (such as reading and writing files) will stubbornly prefer the native encoding.

The enc package provides helpers for converting all textual components of an object to UTF-8, and for reading and writing files in UTF-8 (with a LF end-of-line terminator by default). It also defines an S3 class for tagging all-UTF-8 character vectors and ensuring that updates maintain the UTF-8 encoding. Examples for other packages that use UTF-8 by default are:

Example

library(enc)
utf8(c("a", "ä"))
#> [1] "a" "ä"
as_utf8(1)
#> [1] "1"

a <- utf8("ä")
a[2] <- "ö"
class(a)
#> [1] "utf8"

data.frame(abc = letters[1:3], utf8 = utf8(letters[1:3]))
#>   abc utf8
#> 1   a    a
#> 2   b    b
#> 3   c    c

Install the package from GitHub:

# install.packages("devtools")
devtools::install_github("krlmlr/enc")
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].