All Projects → simdutf → simdutf

simdutf / simdutf

Licence: other
Unicode routines (UTF8, UTF16): billions of characters per second.

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
CWeb
17 projects
python
139335 projects - #7 most used programming language
CMake
9771 projects
Makefile
30231 projects

Projects that are alternatives of or similar to simdutf

simdutf8
SIMD-accelerated UTF-8 validation for Rust.
Stars: ✭ 426 (+294.44%)
Mutual labels:  unicode, neon, simd, avx2
Unisimd Assembler
SIMD macro assembler unified for ARM, MIPS, PPC and x86
Stars: ✭ 63 (-41.67%)
Mutual labels:  neon, simd, avx2
Simde
Implementations of SIMD instruction sets for systems which don't natively support them.
Stars: ✭ 1,012 (+837.04%)
Mutual labels:  neon, simd, avx2
Simd
C++ image processing and machine learning library with using of SIMD: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX-512, VMX(Altivec) and VSX(Power7), NEON for ARM.
Stars: ✭ 1,263 (+1069.44%)
Mutual labels:  neon, simd, avx2
Quadray Engine
Realtime raytracer using SIMD on ARM, MIPS, PPC and x86
Stars: ✭ 13 (-87.96%)
Mutual labels:  neon, simd, avx2
Libsimdpp
Portable header-only C++ low level SIMD library
Stars: ✭ 914 (+746.3%)
Mutual labels:  neon, simd, avx2
vastringify
Type-safe Printf in C
Stars: ✭ 60 (-44.44%)
Mutual labels:  unicode, utf8, utf16
Umesimd
UME::SIMD A library for explicit simd vectorization.
Stars: ✭ 66 (-38.89%)
Mutual labels:  neon, simd, avx2
utf8
Fast UTF-8 validation with range algorithm (NEON+SSE4+AVX2)
Stars: ✭ 60 (-44.44%)
Mutual labels:  neon, simd, avx2
Nsimd
Agenium Scale vectorization library for CPUs and GPUs
Stars: ✭ 138 (+27.78%)
Mutual labels:  neon, simd, avx2
Directxmath
DirectXMath is an all inline SIMD C++ linear algebra library for use in games and graphics apps
Stars: ✭ 859 (+695.37%)
Mutual labels:  neon, simd, avx2
Boost.simd
Boost SIMD
Stars: ✭ 238 (+120.37%)
Mutual labels:  neon, simd, avx2
Fastnoisesimd
C++ SIMD Noise Library
Stars: ✭ 542 (+401.85%)
Mutual labels:  neon, simd, avx2
Vc
SIMD Vector Classes for C++
Stars: ✭ 985 (+812.04%)
Mutual labels:  neon, simd, avx2
Highway
Performance-portable, length-agnostic SIMD with runtime dispatch
Stars: ✭ 301 (+178.7%)
Mutual labels:  neon, simd, avx2
Base64simd
Base64 coding and decoding with SIMD instructions (SSE/AVX2/AVX512F/AVX512BW/AVX512VBMI/ARM Neon)
Stars: ✭ 115 (+6.48%)
Mutual labels:  neon, simd, avx2
Simdjson
Parsing gigabytes of JSON per second
Stars: ✭ 15,115 (+13895.37%)
Mutual labels:  neon, simd, avx2
Turbo-Histogram
Fastest Histogram Construction
Stars: ✭ 44 (-59.26%)
Mutual labels:  simd, avx2, sse2
unicode display width
Displayed width of UTF-8 strings in Modern C++
Stars: ✭ 30 (-72.22%)
Mutual labels:  unicode, utf8
sliceslice-rs
A fast implementation of single-pattern substring search using SIMD acceleration.
Stars: ✭ 66 (-38.89%)
Mutual labels:  simd, avx2

Alpine Linux MSYS2-CI MSYS2-CLANG-CI Ubuntu 20.04 CI (GCC 9) VS16-ARM-CI VS16-CI

simdutf: Unicode validation and transcoding at billions of characters per second

Most modern software relies on the Unicode standard. In memory, Unicode strings are represented using either UTF-8 or UTF-16. The UTF-8 format is the de facto standard on the web (JSON, HTML, etc.) and it has been adopted as the default in many popular programming languages (Go, Rust, Swift, etc.). The UTF-16 format is standard in Java, C# and in many Windows technologies.

Not all sequences of bytes are valid Unicode strings. It is unsafe to use Unicode strings in UTF-8 and UTF-16LE without first validating them. Furthermore, we often need to convert strings from one encoding to another, by a process called transcoding. For security purposes, such transcoding should be validating: it should refuse to transcode incorrect strings.

This library provide fast Unicode functions such as

  • UTF-8 and UTF-16LE validation,
  • UTF-8 to UTF-16LE transcoding, with or without validation,
  • UTF-16LE to UTF-8 transcoding, with or without validation,
  • From an UTF-8 string, compute the size of the UTF-16 equivalent string,
  • From an UTF-16 string, compute the size of the UTF-8 equivalent string,
  • UTF-8 and UTF-16LE character counting.

The functions are accelerated using SIMD instructions (e.g., ARM NEON, SSE, AVX, etc.). When your strings contain hundreds of characters, we can often transcode them at speeds exceeding a billion caracters per second. You should expect high speeds not only with English strings (ASCII) but also Chinese, Japanese, Arabic, and so forth. We handle the full character range (including, for example, emojis).

The library compiles down to tens of kilobytes. Our functions are exception-free and non allocating. We have extensive tests.

How fast is it?

Over a wide range of realistic data sources, we transcode a billion characters per second or more. Our approach can be 3 to 10 times faster than the popular ICU library on difficult (non-ASCII) strings. We can be 20x faster than ICU when processing easy strings (ASCII). Our good results apply to both recent x64 and ARM processors.

To illustrate, we present a benchmark result with values are in billions of characters processed by second. Consider the following figures.

Datasets: https://github.com/lemire/unicode_lipsum

Please refer to our benchmarking tool for a proper interpretation of the numbers. Our results are reproducible.

Requirements

  • C++11 compatible compiler. We support LLVM clang, GCC, Visual Studio. (Our optional benchmark tool requires C++17.)
  • For high speed, you should have a recent 64-bit system (e.g., ARM or x64).
  • If you rely on CMake, you should use a recent CMake (at least 3.15) ; otherwise you may use the single header version. The library is also available from Microsoft's vcpkg.

Usage (Usage)

We made a video to help you get started with the library.

the simdutf library

Usage (CMake)

cmake -B build
cmake --build build
cd build
ctest .

Visual Studio users must specify whether they want to build the Release or Debug version.

To run benchmarks, execute the benchmark command. You can get help on its usage by first building it and then calling it with the --help flag. E.g., under Linux you may do the following:

cmake -B build
cmake --build build
./build/benchmarks/benchmark --help

Instructions are similar for Visual Studio users.

Since ICU is so common and popular, we assume that you may have it already on your system. When it is not found, it is simply omitted from the benchmarks. Thus, to benchmark against ICU, make sure you have ICU installed on your machine and that cmake can find it. For macOS, you may install it with brew using brew install icu4c. If you have ICU on your system but cmake cannot find it, you may need to provide cmake with a path to ICU, such as ICU_ROOT=/usr/local/opt/icu4c cmake -B build.

Single-header version

You can create a single-header version of the library where all of the code is put into two files (simdutf.h and simdutf.cpp). We publish a zip archive containing these files, e.g., see https://github.com/simdutf/simdutf/releases/download/v1.0.0/singleheader.zip

You may generate it on your own using a Python script.

python3 ./singleheader/amalgamate.py

We require Python 3 or better.

Under Linux and macOS, you may test it as follows:

cd singleheader
c++ -o amalgamation_demo amalgamation_demo.cpp -std=c++17
./amalgamation_demo

Example

Using the single-header version, you could compile the following program.

#include <iostream>
#include <memory>

#include "simdutf.cpp"
#include "simdutf.h"

int main(int argc, char *argv[]) {
  const char *source = "1234";
  // 4 == strlen(source)
  bool validutf8 = simdutf::validate_utf8(source, 4);
  if (validutf8) {
    std::cout << "valid UTF-8" << std::endl;
  } else {
    std::cerr << "invalid UTF-8" << std::endl;
    return EXIT_FAILURE;
  }
  // We need a buffer of size where to write the UTF-16LE words.
  size_t expected_utf16words = simdutf::utf16_length_from_utf8(source, 4);
  std::unique_ptr<char16_t[]> utf16_output{new char16_t[expected_utf16words]};
  // convert to UTF-16LE
  size_t utf16words =
      simdutf::convert_utf8_to_utf16(source, 4, utf16_output.get());
  std::cout << "wrote " << utf16words << " UTF-16LE words." << std::endl;
  // It wrote utf16words * sizeof(char16_t) bytes.
  bool validutf16 = simdutf::validate_utf16(utf16_output.get(), utf16words);
  if (validutf16) {
    std::cout << "valid UTF-16LE" << std::endl;
  } else {
    std::cerr << "invalid UTF-16LE" << std::endl;
    return EXIT_FAILURE;
  }
  // convert it back:
  // We need a buffer of size where to write the UTF-8 words.
  size_t expected_utf8words =
      simdutf::utf8_length_from_utf16(utf16_output.get(), utf16words);
  std::unique_ptr<char[]> utf8_output{new char[expected_utf8words]};
  // convert to UTF-8
  size_t utf8words = simdutf::convert_utf16_to_utf8(
      utf16_output.get(), utf16words, utf8_output.get());
  std::cout << "wrote " << utf8words << " UTF-8 words." << std::endl;
  std::string final_string(utf8_output.get(), utf8words);
  std::cout << final_string << std::endl;
  if (final_string != source) {
    std::cerr << "bad conversion" << std::endl;
    return EXIT_FAILURE;
  } else {
    std::cerr << "perfect round trip" << std::endl;
  }
  return EXIT_SUCCESS;
}

API

Our API is made of a few non-allocating function. They typically take a pointer and a length as a parameter, and they sometimes take a pointer to an output buffer. Users are responsible for memory allocation.

namespace simdutf {


/**
 * Validate the UTF-8 string.
 *
 * Overridden by each implementation.
 *
 * @param buf the UTF-8 string to validate.
 * @param len the length of the string in bytes.
 * @return true if and only if the string is valid UTF-8.
 */
simdutf_warn_unused bool validate_utf8(const char *buf, size_t len) noexcept;

/**
 * Validate the UTF-16LE string.
 *
 * Overridden by each implementation.
 *
 * This function is not BOM-aware.
 *
 * @param buf the UTF-16LE string to validate.
 * @param len the length of the string in number of 2-byte words (char16_t).
 * @return true if and only if the string is valid UTF-16LE.
 */
simdutf_warn_unused bool validate_utf16(const char16_t *buf, size_t len) noexcept;

/**
 * Convert possibly broken UTF-8 string into UTF-16LE string.
 *
 * During the conversion also validation of the input string is done.
 * This function is suitable to work with inputs from untrusted sources.
 *
 * @param input         the UTF-8 string to convert
 * @param length        the length of the string in bytes
 * @param utf16_buffer  the pointer to buffer that can hold conversion result
 * @return the number of written char16_t; 0 if the input was not valid UTF-8 string
 */
simdutf_warn_unused size_t convert_utf8_to_utf16(const char * input, size_t length, char16_t* utf8_output) noexcept;

/**
 * Convert valid UTF-8 string into UTF-16LE string.
 *
 * This function assumes that the input string is valid UTF-8.
 *
 * @param input         the UTF-8 string to convert
 * @param length        the length of the string in bytes
 * @param utf16_buffer  the pointer to buffer that can hold conversion result
 * @return the number of written char16_t
 */
simdutf_warn_unused size_t convert_valid_utf8_to_utf16(const char * input, size_t length, char16_t* utf16_buffer) noexcept;

/**
 * Compute the number of 2-byte words that this UTF-8 string would require in UTF-16LE format.
 *
 * This function does not validate the input.
 *
 * @param input         the UTF-8 string to process
 * @param length        the length of the string in bytes
 * @return the number of char16_t words required to encode the UTF-8 string as UTF-16LE
 */
simdutf_warn_unused size_t utf16_length_from_utf8(const char * input, size_t length) noexcept;

/**
 * Convert possibly broken UTF-16LE string into UTF-8 string.
 *
 * During the conversion also validation of the input string is done.
 * This function is suitable to work with inputs from untrusted sources.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to convert
 * @param length        the length of the string in 2-byte words (char16_t)
 * @param utf8_buffer   the pointer to buffer that can hold conversion result
 * @return number of written words; 0 if input is not a valid UTF-16LE string
 */
simdutf_warn_unused size_t convert_utf16_to_utf8(const char16_t * input, size_t length, char* utf8_buffer) noexcept;

/**
 * Convert valid UTF-16LE string into UTF-8 string.
 *
 * This function assumes that the input string is valid UTF-16LE.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to convert
 * @param length        the length of the string in 2-byte words (char16_t)
 * @param utf8_buffer   the pointer to buffer that can hold the conversion result
 * @return number of written words; 0 if conversion is not possible
 */
simdutf_warn_unused size_t convert_valid_utf16_to_utf8(const char16_t * input, size_t length, char* utf8_buffer) noexcept;

/**
 * Compute the number of bytes that this UTF-16LE string would require in UTF-8 format.
 *
 * This function does not validate the input.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to convert
 * @param length        the length of the string in 2-byte words (char16_t)
 * @return the number of bytes required to encode the UTF-16LE string as UTF-8
 */
simdutf_warn_unused size_t utf8_length_from_utf16(const char16_t * input, size_t length) noexcept;

/**
 * Count the number of code points (characters) in the string assuming that
 * it is valid.
 *
 * This function assumes that the input string is valid UTF-16LE.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to process
 * @param length        the length of the string in 2-byte words (char16_t)
 * @return number of code points
 */
simdutf_warn_unused size_t count_utf16(const char16_t * input, size_t length) noexcept;

/**
 * Count the number of code points (characters) in the string assuming that
 * it is valid.
 *
 * This function assumes that the input string is valid UTF-8.
 *
 * @param input         the UTF-8 string to process
 * @param length        the length of the string in bytes
 * @return number of code points
 */
simdutf_warn_unused size_t count_utf8(const char * input, size_t length) noexcept;


}

Usage

The library used by haskell/text.

References

License

This code is made available under the Apache License 2.0 as well as the MIT license.

We include a few competitive solutions under the benchmarks/competition directory. They are provided for research purposes only.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].