All Projects → CharsetDetector → Utf Unknown

CharsetDetector / Utf Unknown

Charset detector build in C# - .NET Core 2-3, .NET standard 1-2 & .NET 4+

Projects that are alternatives of or similar to Utf Unknown

Ant Design Blazor
Enterprise-class UI components based on Ant Design and Blazor.
Stars: ✭ 39 (-62.14%)
Mutual labels:  netstandard
Fluentlyhttpclient
Http Client for .NET Standard with fluent APIs which are intuitive, easy to use and also highly extensible.
Stars: ✭ 73 (-29.13%)
Mutual labels:  netstandard
Nlua
Bridge between Lua and the .NET.
Stars: ✭ 1,326 (+1187.38%)
Mutual labels:  netstandard
Vs Validation
Common input and integrity validation routines for Visual Studio and other applications
Stars: ✭ 48 (-53.4%)
Mutual labels:  netstandard
Abotx
Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
Stars: ✭ 63 (-38.83%)
Mutual labels:  netstandard
Zigbeenet
A .NET Standard library for working with ZigBee
Stars: ✭ 76 (-26.21%)
Mutual labels:  netstandard
Computesharp
A .NET 5 library to run C# code in parallel on the GPU through DX12 and dynamically generated HLSL compute shaders, with the goal of making GPU computing easy to use for all .NET developers! 🚀
Stars: ✭ 982 (+853.4%)
Mutual labels:  netstandard
Yahoofinanceapi
A handy Yahoo! Finance api wrapper, based on .NET Standard 2.0
Stars: ✭ 99 (-3.88%)
Mutual labels:  netstandard
Solr Express
A simple and lightweight query .NET library for Solr, in a controlled, buildable and fail fast way.
Stars: ✭ 66 (-35.92%)
Mutual labels:  netstandard
Roslynpad
A cross-platform C# editor based on Roslyn and AvalonEdit
Stars: ✭ 1,310 (+1171.84%)
Mutual labels:  netstandard
Couchdb Net
EF Core-like CouchDB experience for .NET!
Stars: ✭ 50 (-51.46%)
Mutual labels:  netstandard
Singularity
A extremely fast ioc container for high performance applications
Stars: ✭ 63 (-38.83%)
Mutual labels:  netstandard
Neatinput
A .NET standard project which aims to make keyboard and mouse input monitoring easy on Windows and eventually Linux.
Stars: ✭ 89 (-13.59%)
Mutual labels:  netstandard
Ntwain
A TWAIN lib for dotnet.
Stars: ✭ 42 (-59.22%)
Mutual labels:  netstandard
Shapes
📐 Net standard geometry/shape manipulation library, can be used to merge / split shapes
Stars: ✭ 95 (-7.77%)
Mutual labels:  netstandard
Krakencore
💱 .NET client for Kraken Bitcoin & cryptocurrency exchange API
Stars: ✭ 37 (-64.08%)
Mutual labels:  netstandard
Kbcsv
KBCsv is an efficient, easy to use .NET parsing and writing library for the CSV (comma-separated values) format.
Stars: ✭ 75 (-27.18%)
Mutual labels:  netstandard
Bookfx
Composing Excel spreadsheets based on a tree of nested components like the HTML DOM.
Stars: ✭ 102 (-0.97%)
Mutual labels:  netstandard
Mailmergelib
MailMergeLib is a mail message client library which provides comfortable mail merge capabilities for text, inline images and attachments, as well as good throughput and fault tolerance for sending mail messages.
Stars: ✭ 97 (-5.83%)
Mutual labels:  netstandard
Entityframeworkcore.dataencryption
A plugin for Microsoft.EntityFrameworkCore to add support of encrypted fields using built-in or custom encryption providers.
Stars: ✭ 88 (-14.56%)
Mutual labels:  netstandard

Build status NuGet Pre Release

UTF Unknown

Detect character set for files, streams and other bytes.

Detection of character sets with a simple and redesigned interface.

This package is based on Ude and since version 2 also on uchardet, which are ports of the Mozilla Universal Charset Detector.

The interface and other classes has been resigned so it's easier to use and better object oriented design (OOD). Unit tests and CI has been added.

Features:

  • New API
  • Moved to .NET Standard
  • Added more unit tests
  • Builds on CI (AppVeyor)
  • Strong named
  • Documentation added
  • Multiple bugs from Ude fixed

Supported Platforms

  • .NET Framework 4+,
  • .NET Standard 1.0
  • .NET Standard 1.3 and 2.0 (depends on System.Text.Encoding.CodePages)
  • .NET Core 3.0 (depends on System.Text.Encoding.CodePages, but since with this version, it’s in shared framework)

Remarks: You can still register your EncodingProvider so that the Encoding.GetEncoding(...) method first tries to find in it.

Usage

Use the static detectX methods from CharsetDetector.

// Detect from File (NET standard 1.3+ or .NET 4+)
DetectionResult result = CharsetDetector.DetectFromFile("path/to/file.txt"); // or pass FileInfo

// Detect from Stream (NET standard 1.3+ or .NET 4+)
result = CharsetDetector.DetectFromStream(stream);

// Detect from bytes
results = CharsetDetector.DetectFromBytes(byteArray);

// Get the best Detection
DetectionDetail resultDetected = results.Detected;

// Get the alias of the found encoding
string encodingName = resultDetected.EncodingName;

// Get the System.Text.Encoding of the found encoding (can be null if not available)
Encoding encoding = resultDetected.Encoding;

// Get the confidence of the found encoding (between 0 and 1)
float confidence = resultDetected.Confidence;

// Get all the details of the result
IList<DetectionDetail> allDetails = result.Details;

Docs

The article "A composite approach to language/encoding detection" describes the charsets detection algorithms implemented by the library.

The following charsets are supported to deteсt

Encodings with BOM: utf-7, utf-8, utf-16be/utf-16le, utf-32be/utf-32le, X-ISO-10646-UCS-4-34121/X-ISO-10646-UCS-4-21431, gb18030.

Encodings without BOM are presented in the table, separated by languages:

Language Encodings
International (Unicode) utf-8
Arabic iso-8859-6, windows-1256
Bulgarian iso-8859-5, windows-1251
Chinese iso-2022-cn, big5, euc-tw, gb18030, hz-gb-2312
Croatian iso-8859-2, iso-8859-13, iso-8859-16, windows-1250, ibm852, x-mac-ce
Czech windows-1250, iso-8859-2, ibm852, x-mac-ce
Danish iso-8859-1, iso-8859-15, windows-1252
English ascii
Esperanto iso-8859-3
Estonian iso-8859-4, iso-8859-13, iso-8859-13, windows-1252, windows-1257
Finnish iso-8859-1, iso-8859-4, iso-8859-9, iso-8859-13, iso-8859-15, windows-1252
French iso-8859-1, iso-8859-15, windows-1252
German iso-8859-1, windows-1252
Greek iso-8859-7, windows-1253
Hebrew iso-8859-8, windows-1255
Hungarian iso-8859-2, windows-1250
Irish Gaelic iso-8859-1, iso-8859-9, iso-8859-15, windows-1252
Italian iso-8859-1, iso-8859-3, iso-8859-9, iso-8859-15, windows-1252
Japanese iso-2022-jp, shift-jis, euc-jp
Korean iso-2022-kr, euc-kr/uhc, cp949
Lithuanian iso-8859-4, iso-8859-10, iso-8859-13
Latvian iso-8859-4, iso-8859-10, iso-8859-13
Maltese iso-8859-3
Polish iso-8859-2, iso-8859-13, iso-8859-16, windows-1250, ibm852, x-mac-ce
Portuguese iso-8859-1, iso-8859-9, iso-8859-15, windows-1252
Romanian iso-8859-2, iso-8859-16, windows-1250, ibm852
Russian iso-8859-5, koi8-r, windows-1251, x-mac-cyrillic, ibm855, ibm866
Slovak windows-1250, iso-8859-2, ibm852, x-mac-ce
Slovene iso-8859-2, iso-8859-16, windows-1250, ibm852, x-mac-ce
Spanish iso-8859-1, iso-8859-15, windows-1252
Swedish iso-8859-1, iso-8859-4, iso-8859-9, iso-8859-15, windows-1252
Thai tis-620, iso-8859-11
Turkish iso-8859-3, iso-8859-9
Vietnamese viscii, windows-1258
Others windows-1252

Remarks: For some aliases of encoding not available: cp949, iso-2022-cn, euc-tw, iso-8859-10, iso-8859-16, viscii, X-ISO-10646-UCS-4-34121/X-ISO-10646-UCS-4-21431. Some of them have been offered a suitable replacement for the return result by DetectionDetail.Encoding:

  • cp949 to ks_c_5601-1987
  • iso-2022-cn to x-cp50227

License

The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").

Test data has been extracted from Wikipedia and The Project Gutenberg books and is subject to their licenses.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].