All Projects → rick-de-water → Lingo

rick-de-water / Lingo

Licence: MIT license
Text encoding for modern C++

Programming Languages

C++
36643 projects - #6 most used programming language
CMake
9771 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lingo

utf utils
My work on high-speed conversion of UTF-8 to UTF-32/UTF-16
Stars: ✭ 45 (+60.71%)
Mutual labels:  utf-8, utf-16, utf-32
StringPool
A performant and memory efficient storage for immutable strings with C++17. Supports all standard char types: char, wchar_t, char16_t, char32_t and C++20's char8_t.
Stars: ✭ 19 (-32.14%)
Mutual labels:  utf-8, utf-16, utf-32
Js Codepage
💱 Codepages for JS
Stars: ✭ 119 (+325%)
Mutual labels:  unicode, encoding, text
jurl
Fast and simple URL parsing for Java, with UTF-8 and path resolving support
Stars: ✭ 84 (+200%)
Mutual labels:  unicode, utf-8
Encoding rs
A Gecko-oriented implementation of the Encoding Standard in Rust
Stars: ✭ 196 (+600%)
Mutual labels:  unicode, encoding
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (+628.57%)
Mutual labels:  unicode, text
Unibits
Visualize different Unicode encodings in the terminal
Stars: ✭ 125 (+346.43%)
Mutual labels:  unicode, utf-8
utf8-validator
UTF-8 Validator
Stars: ✭ 18 (-35.71%)
Mutual labels:  unicode, utf-8
ocreval
Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support
Stars: ✭ 48 (+71.43%)
Mutual labels:  unicode, utf-8
homoglyphs
Homoglyphs: get similar letters, convert to ASCII, detect possible languages and UTF-8 group.
Stars: ✭ 70 (+150%)
Mutual labels:  unicode, utf-8
simdutf8
SIMD-accelerated UTF-8 validation for Rust.
Stars: ✭ 426 (+1421.43%)
Mutual labels:  unicode, utf-8
Stringz
💯 Super fast unicode-aware string manipulation Javascript library
Stars: ✭ 181 (+546.43%)
Mutual labels:  unicode, utf-8
Voca rs
Voca_rs is the ultimate Rust string library inspired by Voca.js, string.py and Inflector, implemented as independent functions and on Foreign Types (String and str).
Stars: ✭ 167 (+496.43%)
Mutual labels:  unicode, utf-8
Text
An efficient packed, immutable Unicode text type for Haskell, with a powerful loop fusion optimization framework.
Stars: ✭ 248 (+785.71%)
Mutual labels:  unicode, text
Textwrap
An efficient and powerful Rust library for word wrapping text.
Stars: ✭ 164 (+485.71%)
Mutual labels:  unicode, text
fast-text-encoding
Fast polyfill for TextEncoder and TextDecoder, only supports UTF-8
Stars: ✭ 78 (+178.57%)
Mutual labels:  encoding, utf-8
content inspector
Fast inspection of binary buffers to guess/determine the type of content
Stars: ✭ 28 (+0%)
Mutual labels:  encoding, text
Unicopy
Unicode command-line codepoint dumper
Stars: ✭ 16 (-42.86%)
Mutual labels:  unicode, utf-8
readtext
an R package for reading text files
Stars: ✭ 102 (+264.29%)
Mutual labels:  encoding, text
characteristics
Character info under different encodings
Stars: ✭ 25 (-10.71%)
Mutual labels:  unicode, utf-8

Lingo

Lingo is an encoding aware string library for C++11 and up. It aims to be a drop in replacement for the standard library strings by defining new string classes that mirror the standard library as much as possible, while also extending them with new functionality made possible by its encoding and code page aware design.

Github Actions Codecov Coveralls Releases
Master ga-build ccov-coverage cvrl-coverage release
Latest ga-build ccov-coverage cvrl-coverage

Features

  • Encoding and code page aware lingo::string and lingo::string_view classes, almost fully compatible with std::string and std::string_view.
  • Conversion constructors between lingo::strings of different encodings and code pages.
  • lingo::encoding::* for low level encoding and decoding of code points.
  • lingo::page::* for additional code point information and conversion between different code pages.
  • lingo::error::* for different error handling behaviours.
  • lingo::encoding::point_iterator and lingo::page::point_mapper helpers to manually iterate or convert points individually.
  • lingo::string_converter to manually convert entire strings.
  • Null terminator aware lingo::string_view.
  • lingo::make_null_terminated helper function for APIs that only support C strings, which ensures that a string is null terminated with minimal copying.

How it works

The string class in the C++ the standard library is defined like this:

namespace std
{
    template <class CharT, class Traits, class Allocator>
    class basic_string;
}

CharT is the code point type, and Traits contains all operations to work with the code units. This setup works fine for simple ASCII strings, but runs into problems when working with more complicated encodings.

  • It assumes that every CharT is a code point, while in reality most strings use some kind of multibyte encoding. Encodings such as UTF-8 and UTF-16 can be difficult to work with.
  • It has no information about the code page used. char could be ascii, utf8, iso 8859-1, or anything really. And while the standard is adding char8_t, char16_t and char32_t for unicode, it really only knows that it is a form of Unicode, but has no idea how actually encode, decode or transform the data.

To solve this problem, Lingo defines a new string type:

namespace lingo
{
    template <typename Encoding, typename Page, typename Allocator>
    class basic_string;
}

Lingo splits the responsibility of managing the code points of a string between an Encoding type and a Page type. The Encoding type defines how a code point can be encoded to and decoded from one or more code units. The Page type defines what every decoded code point actually means, and knows how to convert it to other Pages.

Here are some examples of what that actually looks like:

using ascii_string = lingo::basic_string<
    lingo::encoding::none<char, char>,
    lingo::page::ascii>;

using utf8_string = lingo::basic_string<
    lingo::encoding::utf8<char8_t, char32_t>,
    lingo::page::unicode>;

using utf16_string = lingo::basic_string<
    lingo::encoding::utf16<char16_t, char32_t>,
    lingo::page::unicode>;

using utf32_string = lingo::basic_string<
    lingo::encoding::utf32<char32_t, char32_t>,
    lingo::page::unicode>;

using iso_8895_1_string = lingo::basic_string<
    lingo::encoding::none<unsigned char, unsigned char>,
    lingo::page::iso_8895_1>;

You may wonder why there is a lingo::encoding::utf32 encoding, since there is no difference between UTF-32 and decoded Unicode. It is indeed possible to use lingo::encoding::none instead, and still have a fully functional UTF-32 string. However, lingo::encoding::utf32 does add some extra validation, such as detecting surrogate code units, making it better at dealing with invalid inputs.

Currently implemented

Encodings

  • lingo::encoding::none
  • lingo::encoding::utf8
  • lingo::encoding::utf16
  • lingo::encoding::utf32
  • lingo::encoding::base64

Meta encodings

  • lingo::encoding::swap_endian: Swaps the endianness of the code units.
  • lingo::encoding::join: Chains multiple encodings together (e.g. join<swap_edian, utf16> to create utf16_be).

Code pages

  • lingo::page::ascii
  • lingo::page::unicode
  • lingo::page::iso_8859_n with n = [1, 16] except 12.

Error handlers

  • lingo::error::strict Throws an exception on error.

Algorithms

Will be added in a future version.

How to build

Lingo is a header only library, but some of the header files do have to be generated first. You can check the latest releases for a package that has all headers generated for you.

If you want the library yourself, you will have to build the CMake project. All you need is CMake 3.12 or higher, Python 3 (for the code gen) and a C++11 compatible compiler. The tests are written using Catch and can be run with ctest.

How to include in your project

Since Lingo is a header only library, all you need to do is copy the header files and add it as an include directory.

There is one thing that you do need to look out for, which is the execution character set. This library assumes by default that char is UTF-8, and that wchar_t is UTF-16 or UTF-32, depending on the size of wchar_t.

This matches the default settings of GCC and Clang, but not of Visual Studio. If your compiler's execution set does not match the defaults, you have two options:

Configure your compiler

Configure the library

The following macros can be defined to overwrite the default encodings for char and wchar_t:

  • LINGO_CHAR_ENCODING
  • LINGO_WCHAR_ENCODING
  • LINGO_CHAR_PAGE
  • LINGO_WCHAR_PAGE

So for example, if you want to use ISO/IEC 8859-1 for chars, you will have to define the follow macros:

  • -DLINGO_CHAR_ENCODING=none
  • -DLINGO_CHAR_PAGE=iso_8859_1

This method is not recommended. Compiler flags are a much more reliable way to set the correct execution encoding.

Other documentation

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].