Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → DuffsDevice → Tiny Utf8

DuffsDevice / Tiny Utf8

Licence: other

Unicode (UTF-8) capable std::string

Programming Languages

cplusplus

227 projects

Labels

unicode decoder conversion string encoder string-manipulation utf-8

Projects that are alternatives of or similar to Tiny Utf8

Portable Utf8

🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.

Stars: ✭ 405 (+25.78%)

Mutual labels: unicode, string, string-manipulation, utf-8

Voca rs

Voca_rs is the ultimate Rust string library inspired by Voca.js, string.py and Inflector, implemented as independent functions and on Foreign Types (String and str).

Stars: ✭ 167 (-48.14%)

Mutual labels: unicode, string, string-manipulation, utf-8

Util

A collection of useful utility functions

Stars: ✭ 201 (-37.58%)

Mutual labels: conversion, string, string-manipulation

Str

A fast, solid and strong typed string manipulation library with multibyte support

Stars: ✭ 199 (-38.2%)

Mutual labels: string, string-manipulation, utf-8

Stringz

💯 Super fast unicode-aware string manipulation Javascript library

Stars: ✭ 181 (-43.79%)

Mutual labels: unicode, string-manipulation, utf-8

Unicopy

Unicode command-line codepoint dumper

Stars: ✭ 16 (-95.03%)

Mutual labels: conversion, unicode, utf-8

sms

A Go library for encoding and decoding SMSs

Stars: ✭ 37 (-88.51%)

Mutual labels: encoder, decoder, conversion

UniObfuscator

Java obfuscator that hides code in comment tags and Unicode garbage by making use of Java's Unicode escapes.

Stars: ✭ 40 (-87.58%)

Mutual labels: unicode, utf-8

safe-string-interpolation

A type driven approach to string interpolation, aiming at consistent, secure, and only-human-readable logs and console outputs !

Stars: ✭ 14 (-95.65%)

Mutual labels: string, string-manipulation

unicode-c

A C library for handling Unicode, UTF-8, surrogate pairs, etc.

Stars: ✭ 32 (-90.06%)

Mutual labels: unicode, utf-8

morse-pro

Library for manipulating Morse code text and sound. Understands prosigns and Farnsworth speed. Can create WAV files and analyse input from the microphone or audio files.

Stars: ✭ 85 (-73.6%)

Mutual labels: encoder, decoder

StringPool

A performant and memory efficient storage for immutable strings with C++17. Supports all standard char types: char, wchar_t, char16_t, char32_t and C++20's char8_t.

Stars: ✭ 19 (-94.1%)

Mutual labels: string, utf-8

aiff

Battle tested aiff decoder/encoder

Stars: ✭ 20 (-93.79%)

Mutual labels: encoder, decoder

android-opus-codec

Implementation of Opus encoder and decoder in C++ for android with JNI

Stars: ✭ 44 (-86.34%)

Mutual labels: encoder, decoder

fadec

A fast and lightweight decoder for x86 and x86-64 and encoder for x86-64.

Stars: ✭ 44 (-86.34%)

Mutual labels: encoder, decoder

libWinTF8

The library handling things related to UTF-8 and Unicode when you want to port your program to Windows

Stars: ✭ 18 (-94.41%)

Mutual labels: unicode, utf-8

pytextcodifier

📦 Turn your text files into codified images or your codified images into text files.

Stars: ✭ 14 (-95.65%)

Mutual labels: encoder, decoder

widestring-rs

A wide string Rust library for converting to and from wide Unicode strings.

Stars: ✭ 48 (-85.09%)

Mutual labels: unicode, string

schifra

C++ Reed Solomon Error Correcting Library https://www.schifra.com

Stars: ✭ 28 (-91.3%)

Mutual labels: encoder, decoder

nim-ustring

utf-8 string for Nim

Stars: ✭ 12 (-96.27%)

Mutual labels: string, utf-8

View All Similar Projects ➔

TINY 4.3

DESCRIPTION

Tiny-utf8 is a library for extremely easy integration of Unicode into an arbitrary C++11 project. The library consists solely of the class utf8_string, which acts as a drop-in replacement for std::string. Its implementation is successfully in the middle between small memory footprint and fast access. All functionality of std::string is therefore replaced by the corresponding codepoint-based UTF-32 version - translating every access to UTF-8 under the hood.

CHANGES BETWEEN Version 4.3 and 4.2

Class tiny_utf8::basic_utf8_string has been renamed to basic_string, which better resembles its drop-in-capabilities for std::string.

CHANGES BETWEEN Version 4.1 and 4.0

tinyutf8.h has been moved into the folder include/tinyutf8/ in order to mimic the structuring of many other C++-based open source projects.

CHANGES BETWEEN Version 4.0 and 3.2.4

Class utf8_string is now defined inside namespace tiny_utf8. If you want the old declaration in the global namespace, #define TINY_UTF8_GLOBAL_NAMESPACE
Support for C++20: Use class tiny_utf8::u8string, which uses char8_t as underlying data type (instead of char)

FEATURES

Drop-in replacement for std::string
Lightweight and self-contained (~3K SLOC)
Very fast, i.e. highly optimized decoder, encoder and traversal routines
Advanced Memory Layout, i.e. Random Access is
- O(1) for ASCII-only strings (!) and
- O("#Codepoints > 127") for the average case.
- O(n) for strings with a high amount of non-ASCII code points
Small String Optimization (SSO) for strings up to an UTF8-encoded length of sizeof(utf8_string)! That is, including the trailing \0
Growth in Constant Time (Amortized)
On-the-fly Conversion between UTF32 and UTF8
Small Stack Size, i.e. sizeof(utf8_string) = 16 Bytes (32Bit) / 32 Bytes (64Bit)
Codepoint Range of 0x0 - 0xFFFFFFFF, i.e. 1-7 Code Units/Bytes per Codepoint (Note: This is more than specified by UTF8, but until now otherwise considered out of scope)
Complete support for embedded zeros (Note: all methods taking const char*/const char32_t* also have an overload for const char (&)[N]/const char32_t (&)[N], allowing correct interpretation of string literals with embedded zeros)
Single Header File
Straightforward C++11 Design
Possibility to prepend the UTF8 BOM (Byte Order Mark) to any string when converting it to an std::string
Supports raw (Byte-based) access for occasions where Speed is needed
Supports shrink_to_fit()
Malformed UTF8 sequences will lead to defined behaviour

THE PURPOSE OF TINY-UTF8

Back when I decided to write a UTF8 solution for C++, I knew I wanted a drop-in replacement for std::string. At the time mostly because I found it neat to have one and felt C++ always lacked accessible support for UTF8. Since then, several years have passed and the situation has not improved much. That said, things currently look like they are about to improve - but that doesn't say much, does it?

The opinion shared by many "experienced Unicode programmers" (e.g. published on UTF-8 Everywhere) is that "non-experienced" programmers both under and overestimate the need for Unicode- and encoding-specific treatment: This need is...

overestimated, because many times we really should care less about codepoint/grapheme borders within string data;
underestimated, because if we really want to "support" unicode, we need to think about normalizations, visual character comparisons, reserved codepoint values, illegal code unit sequences and so on and so forth.

Unicode is not rocket science but nonetheless hard to get right. Tiny-utf8 does not intend to be an enterprise solution like ICU for C++. The goal of tiny-utf8 is to

bridge as many gaps to "supporting Unicode" as possible by 'just' replacing std::string with a custom class which means to
provide you with a Codepoint Abstraction Layer that takes care of the Run-Length Encoding, without you noticing.

Tiny-utf8 aims to be the simple-and-dependable groundwork which you build Unicode infrastructure upon. And, if 1) C++2a should happen to make your Unicode life easier than tiny-utf8 or 2) you decide to go enterprise, you have not wasted much time replacing std::string with utf8_string either. This is what makes tiny-utf8 so agreeable.

WHAT TINY-UTF8 IS NOT AIMED AT

Conversion between ISO encodings and UTF8
Interfacing with UTF16
Visible character comparison ('ch' vs. 'c'+'h')
Codepoint Normalization
Correction of invalid Code Unit sequences
Detection of Grapheme Clusters

Note: ANSI suppport was dropped in Version 2.0 in favor of execution speed.

EXAMPLE USAGE

#include <iostream>
#include <algorithm>
#include <tinyutf8/tinyutf8.h>
using namespace std;

int main()
{
    tiny_utf8::string str = u8"!🌍 olleH";
    for_each( str.rbegin() , str.rend() , []( char32_t codepoint ){
      cout << codepoint;
    } );
    return 0;
}

EXCEPTIONS

Tiny-utf8 should automaticall detect, whether your build system allows the use of exceptions or not. This is done by checking for the feature test macro __cpp_exceptions.
If you would like tiny-utf8 to be noexcept anyway, #define the macro TINY_UTF8_NOEXCEPT.
If you would like tiny-utf8 to use a different exception strategy, #define the macro TINY_UTF8_THROW( location , failing_predicate ). For using assertions, you would write ``#define TINY_UTF8_THROW( _ , pred ) assert( pred ).
Hint: If exceptions are disabled, TINY_UTF8_THROW( ... ) is automatically defined as void(). This works well, because all uses of TINY_UTF8_THROW are immediately followed by a ; as well as a proper return statement with a fallback value. That also means, TINY_UTF8_THROW can safely be a NO-OP.

BACKWARDS-COMPATIBLE BUILD

If you would like to stay compatible with 3.2.* and have utf8_string defined in the global namespace, #define the macro TINY_UTF8_GLOBAL_NAMESPACE.

BUGS

If you encounter any bugs, please file a bug report through the "Issues" tab. I'll try to answer it soon!

THANK YOU

@iainchesworth
@vadim-berman
@MattHarrington
@evanmoran
@bakerstu
@revel8n
@githubuser0xFFFF
@marekfoltyn
@Megaxela
@vfiksdal
@maddouri

for taking your time to improve tiny-utf8.

Cheers, Jakob

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 322

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗