All Projects → digital-preservation → utf8-validator

digital-preservation / utf8-validator

Licence: BSD-3-Clause license
UTF-8 Validator

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to utf8-validator

Encoding.js
Convert or detect character encoding in JavaScript
Stars: ✭ 338 (+1777.78%)
Mutual labels:  unicode, utf-8
Transliteration
UTF-8 to ASCII transliteration / slugify module for node.js, browser, Web Worker, React Native, Electron and CLI.
Stars: ✭ 444 (+2366.67%)
Mutual labels:  unicode, utf-8
Bstr
A string type for Rust that is not required to be valid UTF-8.
Stars: ✭ 348 (+1833.33%)
Mutual labels:  unicode, utf-8
unicode-c
A C library for handling Unicode, UTF-8, surrogate pairs, etc.
Stars: ✭ 32 (+77.78%)
Mutual labels:  unicode, utf-8
Voca rs
Voca_rs is the ultimate Rust string library inspired by Voca.js, string.py and Inflector, implemented as independent functions and on Foreign Types (String and str).
Stars: ✭ 167 (+827.78%)
Mutual labels:  unicode, utf-8
libWinTF8
The library handling things related to UTF-8 and Unicode when you want to port your program to Windows
Stars: ✭ 18 (+0%)
Mutual labels:  unicode, utf-8
Portable Utf8
🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.
Stars: ✭ 405 (+2150%)
Mutual labels:  unicode, utf-8
UnicodeBOMInputStream
Doing things right, in the name of Sun / Oracle
Stars: ✭ 36 (+100%)
Mutual labels:  unicode, utf-8
Unibits
Visualize different Unicode encodings in the terminal
Stars: ✭ 125 (+594.44%)
Mutual labels:  unicode, utf-8
Unicopy
Unicode command-line codepoint dumper
Stars: ✭ 16 (-11.11%)
Mutual labels:  unicode, utf-8
StringConvert
A simple C++11 based helper for converting string between a various charset
Stars: ✭ 16 (-11.11%)
Mutual labels:  unicode, charset
jurl
Fast and simple URL parsing for Java, with UTF-8 and path resolving support
Stars: ✭ 84 (+366.67%)
Mutual labels:  unicode, utf-8
UniObfuscator
Java obfuscator that hides code in comment tags and Unicode garbage by making use of Java's Unicode escapes.
Stars: ✭ 40 (+122.22%)
Mutual labels:  unicode, utf-8
Tiny Utf8
Unicode (UTF-8) capable std::string
Stars: ✭ 322 (+1688.89%)
Mutual labels:  unicode, utf-8
Lingo
Text encoding for modern C++
Stars: ✭ 28 (+55.56%)
Mutual labels:  unicode, utf-8
Tomlplusplus
Header-only TOML config file parser and serializer for C++17 (and later!).
Stars: ✭ 403 (+2138.89%)
Mutual labels:  unicode, utf-8
characteristics
Character info under different encodings
Stars: ✭ 25 (+38.89%)
Mutual labels:  unicode, utf-8
simdutf8
SIMD-accelerated UTF-8 validation for Rust.
Stars: ✭ 426 (+2266.67%)
Mutual labels:  unicode, utf-8
Awesome Unicode
😂 👌 A curated list of delightful Unicode tidbits, packages and resources.
Stars: ✭ 693 (+3750%)
Mutual labels:  unicode, utf-8
Stringz
💯 Super fast unicode-aware string manipulation Javascript library
Stars: ✭ 181 (+905.56%)
Mutual labels:  unicode, utf-8

UTF-8 Validator

A UTF-8 Validation Tool which may be used as either a command line tool or as a library embedded in your own program.

Released under the BSD 3-Clause Licence.

CI Maven Central

Use from the Command Line

You can either download the application from here or build from the source code. You should extract this ZIP file to the place on your computer where you keep your applications. You can then run either bin/validate.sh (Linux/Mac/Unix) or bin\validate.bat (Windows).

For example, to report all validation errors:

$ cd /opt/utf8-validator-1.2
$ bin/validate /tmp/my-file.txt

For example to report the first validation error and exit:

$ cd /opt/utf8-validator-1.2
$ bin/validate.sh --fail-fast /tmp/my-file.txt

Command Line Exit Codes

  • 0 Success
  • 1 Invalid Arguments provided to the application
  • 2 File was not UTF-8 Valid
  • 4 IO Error, e.g. could not read file

Use as a Library

The UTF-8 Validator is written in Java and may be easily used from any Java (Scala, Clojure, JVM Language etc) application. We are using the Maven build system, and our artifacts have been published to Maven Central.

If you are using Maven, you can simply add this to the dependencies section of your pom.xml:

<dependency>
    <groupId>uk.gov.nationalarchives</groupId>
    <artifactId>utf8-validator</artifactId>
    <version>1.2</version>
</dependency>

Alternatively if you are using Sbt, you can add this to your library dependencies:

"uk.gov.nationalarchives" % "utf8-validator" % "1.2"

To use the Library you need to implement the very simple interface uk.gov.nationalarchives.utf8.validator.ValidationHandler (or you could use uk.gov.nationalarchives.utf8.validator.PrintingValidationHandler if it suits you). The interface has a single method which is called whenever a validator finds a validation error. You can then instantiate Utf8Validator and validate from either a java.io.File or java.io.InputStream. For example:

ValidationHandler handler = new ValidationHandler() {
	@Override
	public void error(final String message, final long byteOffset) throws ValidationException {
		System.err.println("[Error][@" + byteOffset + "] " + message);
	};
};

File f = ... //your file here

new Utf8Validator(handler).validate(f);

Building from Source Code

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].