All Projects → gpakosz → UnicodeBOMInputStream

gpakosz / UnicodeBOMInputStream

Licence: other
Doing things right, in the name of Sun / Oracle

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to UnicodeBOMInputStream

Awesome Unicode
😂 👌 A curated list of delightful Unicode tidbits, packages and resources.
Stars: ✭ 693 (+1825%)
Mutual labels:  unicode, utf-8
Stringz
💯 Super fast unicode-aware string manipulation Javascript library
Stars: ✭ 181 (+402.78%)
Mutual labels:  unicode, utf-8
Unicopy
Unicode command-line codepoint dumper
Stars: ✭ 16 (-55.56%)
Mutual labels:  unicode, utf-8
Tomlplusplus
Header-only TOML config file parser and serializer for C++17 (and later!).
Stars: ✭ 403 (+1019.44%)
Mutual labels:  unicode, utf-8
simdutf8
SIMD-accelerated UTF-8 validation for Rust.
Stars: ✭ 426 (+1083.33%)
Mutual labels:  unicode, utf-8
Portable Utf8
🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.
Stars: ✭ 405 (+1025%)
Mutual labels:  unicode, utf-8
Voca rs
Voca_rs is the ultimate Rust string library inspired by Voca.js, string.py and Inflector, implemented as independent functions and on Foreign Types (String and str).
Stars: ✭ 167 (+363.89%)
Mutual labels:  unicode, utf-8
libWinTF8
The library handling things related to UTF-8 and Unicode when you want to port your program to Windows
Stars: ✭ 18 (-50%)
Mutual labels:  unicode, utf-8
utf8-validator
UTF-8 Validator
Stars: ✭ 18 (-50%)
Mutual labels:  unicode, utf-8
ocreval
Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support
Stars: ✭ 48 (+33.33%)
Mutual labels:  unicode, utf-8
Bstr
A string type for Rust that is not required to be valid UTF-8.
Stars: ✭ 348 (+866.67%)
Mutual labels:  unicode, utf-8
characteristics
Character info under different encodings
Stars: ✭ 25 (-30.56%)
Mutual labels:  unicode, utf-8
Encoding.js
Convert or detect character encoding in JavaScript
Stars: ✭ 338 (+838.89%)
Mutual labels:  unicode, utf-8
Transliteration
UTF-8 to ASCII transliteration / slugify module for node.js, browser, Web Worker, React Native, Electron and CLI.
Stars: ✭ 444 (+1133.33%)
Mutual labels:  unicode, utf-8
Tiny Utf8
Unicode (UTF-8) capable std::string
Stars: ✭ 322 (+794.44%)
Mutual labels:  unicode, utf-8
Unibits
Visualize different Unicode encodings in the terminal
Stars: ✭ 125 (+247.22%)
Mutual labels:  unicode, utf-8
UniObfuscator
Java obfuscator that hides code in comment tags and Unicode garbage by making use of Java's Unicode escapes.
Stars: ✭ 40 (+11.11%)
Mutual labels:  unicode, utf-8
unicode-c
A C library for handling Unicode, UTF-8, surrogate pairs, etc.
Stars: ✭ 32 (-11.11%)
Mutual labels:  unicode, utf-8
jurl
Fast and simple URL parsing for Java, with UTF-8 and path resolving support
Stars: ✭ 84 (+133.33%)
Mutual labels:  unicode, utf-8
homoglyphs
Homoglyphs: get similar letters, convert to ASCII, detect possible languages and UTF-8 group.
Stars: ✭ 70 (+94.44%)
Mutual labels:  unicode, utf-8

UnicodeBOMInputStream

A helper class to skip Unicode BOMs at the beginning of input streams.

I initially released this class as a Stack Overflow answer and it apparently got copy-pasted into several Java projects already. However, code put as answers on Stack Overflow is licensed under CC-BY-SA 3.0 which may not suit everybody.


Why?

Many years have passed since I wrote this class and today Java still doesn't properly deal with UTF-8 Unicode Byte Order Marks (BOMs) at the beginning of data. In 2001, someone opened bug JDK-4508058 with the sound expectation Java should detect and skip UTF-8 BOMs at the beginning of UTF-8 streams, the same way it does for e.g. UTF-16.

Bug JDK-4508058 remained open for a while, then got fixed and ultimately reverted because some other great programmers relied on that exact same bug:

the Java EE 5 RI and SJSAS 9.0 has been relying on detecting a BOM, setting the appropriate encoding, and discarding the BOM bytes before reading the input

See, they're complaining because shipped code breaks if/when JDK behavior changes. And instead of fixing JDK-4508058 and accept this would be an annoyance only for Java EE 5 RI and SJSAS 9.0 users, people in charge at Sun decided we're all living in a better world if JDK-4508058 gets closed as "won't fix". Because fuck you, just skip the BOM yourself.


Usage

Wrap any InputStream with UnicodeBOMInputStream and use getBOM() and/or skipBOM() methods. See UnicodeBOMInputStreamUsage.java.


If you find this library useful and decide to use it in your own projects please drop me a line @gpakosz.

If you use it in a commercial project, consider using Gittip.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].