All Projects → BurntSushi → Bstr

BurntSushi / Bstr

Licence: other
A string type for Rust that is not required to be valid UTF-8.

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Bstr

Tiny Utf8
Unicode (UTF-8) capable std::string
Stars: ✭ 322 (-7.47%)
Mutual labels:  unicode, utf-8
characteristics
Character info under different encodings
Stars: ✭ 25 (-92.82%)
Mutual labels:  unicode, utf-8
ocreval
Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support
Stars: ✭ 48 (-86.21%)
Mutual labels:  unicode, utf-8
Voca rs
Voca_rs is the ultimate Rust string library inspired by Voca.js, string.py and Inflector, implemented as independent functions and on Foreign Types (String and str).
Stars: ✭ 167 (-52.01%)
Mutual labels:  unicode, utf-8
UniObfuscator
Java obfuscator that hides code in comment tags and Unicode garbage by making use of Java's Unicode escapes.
Stars: ✭ 40 (-88.51%)
Mutual labels:  unicode, utf-8
Stringz
💯 Super fast unicode-aware string manipulation Javascript library
Stars: ✭ 181 (-47.99%)
Mutual labels:  unicode, utf-8
homoglyphs
Homoglyphs: get similar letters, convert to ASCII, detect possible languages and UTF-8 group.
Stars: ✭ 70 (-79.89%)
Mutual labels:  unicode, utf-8
Transliteration
UTF-8 to ASCII transliteration / slugify module for node.js, browser, Web Worker, React Native, Electron and CLI.
Stars: ✭ 444 (+27.59%)
Mutual labels:  unicode, utf-8
Lingo
Text encoding for modern C++
Stars: ✭ 28 (-91.95%)
Mutual labels:  unicode, utf-8
UnicodeBOMInputStream
Doing things right, in the name of Sun / Oracle
Stars: ✭ 36 (-89.66%)
Mutual labels:  unicode, utf-8
Unibits
Visualize different Unicode encodings in the terminal
Stars: ✭ 125 (-64.08%)
Mutual labels:  unicode, utf-8
libWinTF8
The library handling things related to UTF-8 and Unicode when you want to port your program to Windows
Stars: ✭ 18 (-94.83%)
Mutual labels:  unicode, utf-8
Unicopy
Unicode command-line codepoint dumper
Stars: ✭ 16 (-95.4%)
Mutual labels:  unicode, utf-8
jurl
Fast and simple URL parsing for Java, with UTF-8 and path resolving support
Stars: ✭ 84 (-75.86%)
Mutual labels:  unicode, utf-8
Awesome Unicode
😂 👌 A curated list of delightful Unicode tidbits, packages and resources.
Stars: ✭ 693 (+99.14%)
Mutual labels:  unicode, utf-8
utf8-validator
UTF-8 Validator
Stars: ✭ 18 (-94.83%)
Mutual labels:  unicode, utf-8
Tomlplusplus
Header-only TOML config file parser and serializer for C++17 (and later!).
Stars: ✭ 403 (+15.8%)
Mutual labels:  unicode, utf-8
Portable Utf8
🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.
Stars: ✭ 405 (+16.38%)
Mutual labels:  unicode, utf-8
simdutf8
SIMD-accelerated UTF-8 validation for Rust.
Stars: ✭ 426 (+22.41%)
Mutual labels:  unicode, utf-8
unicode-c
A C library for handling Unicode, UTF-8, surrogate pairs, etc.
Stars: ✭ 32 (-90.8%)
Mutual labels:  unicode, utf-8

bstr

This crate provides extension traits for &[u8] and Vec<u8> that enable their use as byte strings, where byte strings are conventionally UTF-8. This differs from the standard library's String and str types in that they are not required to be valid UTF-8, but may be fully or partially valid UTF-8.

Build status

Documentation

https://docs.rs/bstr

When should I use byte strings?

See this part of the documentation for more details: https://docs.rs/bstr/0.2.*/bstr/#when-should-i-use-byte-strings.

The short story is that byte strings are useful when it is inconvenient or incorrect to require valid UTF-8.

Usage

Add this to your Cargo.toml:

[dependencies]
bstr = "0.2"

Examples

The following two examples exhibit both the API features of byte strings and the I/O convenience functions provided for reading line-by-line quickly.

This first example simply shows how to efficiently iterate over lines in stdin, and print out lines containing a particular substring:

use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        if line.contains_str("Dimension") {
            stdout.write_all(line)?;
        }
        Ok(true)
    })?;
    Ok(())
}

This example shows how to count all of the words (Unicode-aware) in stdin, line-by-line:

use std::error::Error;
use std::io;

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut words = 0;
    stdin.lock().for_byte_line_with_terminator(|line| {
        words += line.words().count();
        Ok(true)
    })?;
    println!("{}", words);
    Ok(())
}

This example shows how to convert a stream on stdin to uppercase without performing UTF-8 validation and amortizing allocation. On standard ASCII text, this is quite a bit faster than what you can (easily) do with standard library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)

use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    let mut upper = vec![];
    stdin.lock().for_byte_line_with_terminator(|line| {
        upper.clear();
        line.to_uppercase_into(&mut upper);
        stdout.write_all(&upper)?;
        Ok(true)
    })?;
    Ok(())
}

This example shows how to extract the first 10 visual characters (as grapheme clusters) from each line, where invalid UTF-8 sequences are generally treated as a single character and are passed through correctly:

use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        let end = line
            .grapheme_indices()
            .map(|(_, end, _)| end)
            .take(10)
            .last()
            .unwrap_or(line.len());
        stdout.write_all(line[..end].trim_end())?;
        stdout.write_all(b"\n")?;
        Ok(true)
    })?;
    Ok(())
}

Cargo features

This crates comes with a few features that control standard library, serde and Unicode support.

  • std - Enabled by default. This provides APIs that require the standard library, such as Vec<u8>.
  • unicode - Enabled by default. This provides APIs that require sizable Unicode data compiled into the binary. This includes, but is not limited to, grapheme/word/sentence segmenters. When this is disabled, basic support such as UTF-8 decoding is still included.
  • serde1 - Disabled by default. Enables implementations of serde traits for the BStr and BString types.
  • serde1-nostd - Disabled by default. Enables implementations of serde traits for the BStr type only, intended for use without the standard library. Generally, you either want serde1 or serde1-nostd, not both.

Minimum Rust version policy

This crate's minimum supported rustc version (MSRV) is 1.28.0.

In general, this crate will be conservative with respect to the minimum supported version of Rust. MSRV may be bumped in minor version releases.

Future work

Since this is meant to be a core crate, getting a 1.0 release is a priority. My hope is to move to 1.0 within the next year and commit to its API so that bstr can be used as a public dependency.

A large part of the API surface area was taken from the standard library, so from an API design perspective, a good portion of this crate should be mature. The main differences from the standard library are in how the various substring search routines work. The standard library provides generic infrastructure for supporting different types of searches with a single method, where as this library prefers to define new methods for each type of search and drop the generic infrastructure.

Some probable future considerations for APIs include, but are not limited to:

  • A convenience layer on top of the aho-corasick crate.
  • Unicode normalization.
  • More sophisticated support for dealing with Unicode case, perhaps by combining the use cases supported by caseless and unicase.
  • Add facilities for dealing with OS strings and file paths, probably via simple conversion routines.

Here are some examples that are probably out of scope for this crate:

  • Regular expressions.
  • Unicode collation.

The exact scope isn't quite clear, but I expect we can iterate on it.

In general, as stated below, this crate is an experiment in bringing lots of related APIs together into a single crate while simultaneously attempting to keep the total number of dependencies low. Indeed, every dependency of bstr, except for memchr, is optional.

High level motivation

Strictly speaking, the bstr crate provides very little that can't already be achieved with the standard library Vec<u8>/&[u8] APIs and the ecosystem of library crates. For example:

  • The standard library's Utf8Error can be used for incremental lossy decoding of &[u8].
  • The unicode-segmentation crate can be used for iterating over graphemes (or words), but is only implemented for &str types. One could use Utf8Error above to implement grapheme iteration with the same semantics as what bstr provides (automatic Unicode replacement codepoint substitution).
  • The twoway crate can be used for fast substring searching on &[u8].

So why create bstr? Part of the point of the bstr crate is to provide a uniform API of coupled components instead of relying on users to piece together loosely coupled components from the crate ecosystem. For example, if you wanted to perform a search and replace in a Vec<u8>, then writing the code to do that with the twoway crate is not that difficult, but it's still additional glue code you have to write. This work adds up depending on what you're doing. Consider, for example, trimming and splitting, along with their different variants.

In other words, bstr is partially a way of pushing back against the micro-crate ecosystem that appears to be evolving. It's not clear to me whether this experiment will be successful or not, but it is definitely a goal of bstr to keep its dependency list lightweight. For example, serde is an optional dependency because there is no feasible alternative, but twoway is not, where we instead prefer to implement our own substring search. In service of this philosophy, currently, the only required dependency of bstr is memchr.

License

This project is licensed under either of

at your option.

The data in src/unicode/data/ is licensed under the Unicode License Agreement (LICENSE-UNICODE), although this data is only used in tests.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].