All Projects → kdex → unzalgo

kdex / unzalgo

Licence: GPL-3.0 license
Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to unzalgo

character
tool for character manipulations
Stars: ✭ 26 (-31.58%)
Mutual labels:  unicode
attic
A collection of personal tiny tools - mirror of https://gitlab.com/hydrargyrum/attic
Stars: ✭ 17 (-55.26%)
Mutual labels:  unicode
CSV2RDF
Streaming, transforming, SPARQL-based CSV to RDF converter. Apache license.
Stars: ✭ 48 (+26.32%)
Mutual labels:  transformation
BLogger
An easy to use modern C++14/17 async cross-platform logger which supports custom formatting/patterns, colored output, Unicode, file logging, log rotation & more!
Stars: ✭ 23 (-39.47%)
Mutual labels:  unicode
cq
Clojure Command-line Data Processor for JSON, YAML, EDN, XML and more
Stars: ✭ 111 (+192.11%)
Mutual labels:  transformation
couplet
Unicode code points support for Clojure
Stars: ✭ 21 (-44.74%)
Mutual labels:  unicode
thesis template
A comprehensive LaTeX template with examples for theses, books and more, employing the 'latest and greatest' (UTF8, glossaries, fonts, ...). The PDF artifact is built using CI/CD.
Stars: ✭ 121 (+218.42%)
Mutual labels:  unicode
stringx
Drop-in replacements for base R string functions powered by stringi
Stars: ✭ 14 (-63.16%)
Mutual labels:  unicode
Tehreer-Cocoa
Standalone text engine for iOS
Stars: ✭ 31 (-18.42%)
Mutual labels:  unicode
sugartex
SugarTeX is a more readable LaTeX language extension and transcompiler to LaTeX. Fast Unicode autocomplete in Atom editor via https://github.com/kiwi0fruit/atom-sugartex-completions
Stars: ✭ 74 (+94.74%)
Mutual labels:  unicode
opentype-shaping-documents
Documentation of OpenType shaping behavior
Stars: ✭ 121 (+218.42%)
Mutual labels:  unicode
urdu-characters
📄 Complete collection of Urdu language characters & unicode code points.
Stars: ✭ 24 (-36.84%)
Mutual labels:  unicode
Image Processing
Image Processing techniques using OpenCV and Python.
Stars: ✭ 112 (+194.74%)
Mutual labels:  transformation
rouziclib
This is my personal library of code that is common to my different projects (Photosounder, SplineEQ, Spiral and others)
Stars: ✭ 38 (+0%)
Mutual labels:  unicode
Lingo
Text encoding for modern C++
Stars: ✭ 28 (-26.32%)
Mutual labels:  unicode
TypeGame
👾 Sokoban Game in Pure TypeScript Type System
Stars: ✭ 222 (+484.21%)
Mutual labels:  unicode
durdraw
Animated Unicode, ANSI and ASCII Art Editor for Linux/Unix/macOS
Stars: ✭ 55 (+44.74%)
Mutual labels:  unicode
2048-rs
Rust implementation of 2048 game
Stars: ✭ 15 (-60.53%)
Mutual labels:  unicode
flag-emoji-replacements
'🇩🇰🇲🇬'.replace('🇰🇲', '🇪🇨'); // → '🇩🇪🇨🇬'
Stars: ✭ 37 (-2.63%)
Mutual labels:  unicode
Stringy
🉑 Stringy - A PHP string manipulation library with multibyte support, performance optimized
Stars: ✭ 135 (+255.26%)
Mutual labels:  unicode

unzalgo

build codecov dependency Status

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.

Installation

$ npm install unzalgo

About

You can use unzalgo to both detect Zalgo text and transform it back into normal text without breaking internationalization. For example, you could transform:

T͘H͈̩̬̺̩̭͇I͏̼̪͚̪͚S͇̬̺ ́E̬̬͈̮̻̕V҉̙I̧͖̜̹̩̞̱L͇͍̝ ̺̮̟̙̘͎U͝S̞̫̞͝E͚̘͝R IṊ͍̬͞P̫Ù̹̳̝͓̙̙T̜͕̺̺̳̘͝

into

THIS EVIL USER INPUT

while also keeping

thiŝ te̅xt unchanged, since some lângûaĝes aĉtuallŷ uŝe thêse sŷmbo̅ls,

and, at the same time, keep all diacritics in

Z nich ovšem pouze předposlední sdílí s výše uvedenou větou příliš žluťoučký kůň úpěl […]

which remains unchanged after a transformation.

Is there a demo?

Yes! You can check it out here. You can edit the text at the top; the lower part shows the text after clean using the default threshold.

How does it work?

In Unicode, every character is assigned to a character category. Zalgo text uses characters that belong to the categories Mn (Mark, Nonspacing) or Me (Mark, Enclosing).

First, the text is divided into words; each word is then assigned to a score that corresponds to the usage of the categories above, combined with small use of statistics. If the score exceeds a threshold, we're able to detect Zalgo text (which allows us to strip away all characters from the above categories).

Getting started

Regular cleaning

import { clean } from "unzalgo";
assert("this" === clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋"));

Partial cleaning

import { clean } from "unzalgo";
assert("Somê̋̂͠ Zalgó̈́̅͠ text.̕͘̕͘͠" === clean("S̷̡̡̡̱̦̣̹̭̻͔̣͖̤̜̮̓̒̋̋͌̄̊̄̎̓o̷̡̢̨̗͔̤̫͙̖̙͉̲̘͙͔͖̤͎̙͓̳̣̳̣͋̀̇́̈̏̾̇̇̀̎̔̓̇̆͝ḿ̸̨̮̟̣̙̮̲̭͎͓͖̘̒̀͌͆͊̿̾̄̽̀̔̈́̍̒͒̔̕͝e̷̛͖̤͍̬͖̔̈̆̐͌̃̓͌̽͑̾̐̇̑̇̈̂̋̂͠ ̵̢̨̞͕̥̯̼͈̺̖̞̥̳̤̓̇̓̓̈͆Z̷̡̬̱̺̘̹̭̙̭͚̼̝̤̳̦̲̬̜͌͌̊̆ͅͅa̷̢̡̺͕͈̰̮̲͔̙̱̼͉̲̼̝̝̻̹̱̹̝̗̿̋͐̊͑͑͐̽̆̉̓͋̽̅̈́̚͜͝l̶͇͍͈̞̠̜͕͒̑͆̇̊̚͝g̸̨̛͎͚͚̗̘͙͔͓̠̝͔̬̳̗̯̮͍̻̥̃͊̏̐̌͒̀̓͛͠ȭ̷̢̡̧̤͇̮͕̘̱̬̖̪͈̘̟̉͑͌̑̿̇̊̿͛͊̎͌̀̽͛͋̃̑́̈́̈́̅͠ ̷̛͙͙̜̫̼̙̯́̉̊̿̈́́̽͛̓̓̊̓̋̏̀͌͠ͅt̷̠̞̯̤̃̇̒̾͒̑͋̒̈́͋͗̉̉͐̍̾͑̈́̈́͌͆̀̂̋͌͜ȩ̶̧̢̡̛̛̣͕̥͕͇̖͈̗͍̖̠͚̮͙̅̂̌́̐͛͗̽͋́̿͂̅̒͌̐͆̏̕͜͝͝͝͠x̶̡̧̛͚̗̖̙͚͍̻̙̥͓͖͕͍̮͖̙̙̜͓͈̩̯̐͛̏̍́͌̏̂̀̐͛͂̈́̆́̀̒̉̾̈́͌͘͜͠t̸̨̨̲̟͎̩̱̹̬͙̩̠͇̪͒̃̒͛̍̎̂͒̀́́̍.̵̮͐̋̐͐̅̿̿́̊͑́̂͗͂̊̽̚̕̕͘̕͘͠", {
	targetDensity: 0.5
}));

Configuring detection

import { clean } from "unzalgo";
/* Clean only if there are no "normal" characters in the word (t, h, i and s are "normal") */
assert("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋" === clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋", {
	detectionThreshold: 1
}));
/* Clean only if there is at least one combining character */
import { clean } from "unzalgo";
assert("francais" === clean("français", {
	detectionThreshold: 0
}));
import { clean } from "unzalgo";
/* `français` remains intact by default */
assert("français" === clean("français"));

Internationalization

import { isZalgo } from "unzalgo";
/* "français" is not a Zalgo text, of course */
assert(isZalgo("français") === false);
import { isZalgo } from "unzalgo";
/* Unless you define the Zalgo property as containing combining characters */
assert(isZalgo("français", 0) === true);
import { isZalgo } from "unzalgo";
/* You can also define the Zalgo property as consisting of nothing but combining characters */
assert(isZalgo("français", 1) === false);

Detection threshold

Some of this library's functions accept a detectionThreshold option that lets you configure how sensitively unzalgo behaves. It is a number from 0 to 1 and defaults to 0.55.

A detection threshold of 0 indicates that a string should be classified as Zalgo text if at least 0 % of its codepoints have the Unicode category Mn or Me.

A detection threshold of 1 indicates that a string should be classified as Zalgo text if at least 100 % of its codepoints have the Unicode category Mn or Me.

Exports

clean(string[, options]): string [default export]

Removes all combining characters for every word in a string if the word is classified as Zalgo text. If targetDensity is specified, not all the Zalgo characters will be removed. Instead, they will be thinned out uniformly.

Returns a cleaned, more readable string.

Arguments:

  • string: string A string for which combining characters are removed for every word whose Zalgo property is met.
  • options: object An object of options.
  • options.detectionThreshold: number = 0.55 A threshold ∈ [0, 1]. The higher the threshold, the more combining characters are needed for it to be detected as Zalgo text.
  • options.targetDensity: number = 0 A threshold ∈ [0, 1]. The density is defined as the ratio of combining characters to all characters. The higher the density, the more combining characters will be part of the result. The result is guaranteed to have a density that is less than or equal to the one provided. A target density of 0 indicates that none of the combining characters are allowed to be part of the result. A target density of 1 indicates that all combining characters should be part of the result. Any number in between those limits will attempt to uniformly reduce the occurrence of combining characters. This reduction step assumes that the combining characters are uniformly distributed in the input string.

computeScores(string): number[]

Computes a score ∈ [0, 1] for every word in the input string. Each score represents the ratio of Zalgo characters to total characters in a word.

Returns An array of scores where each score describes the Zalgo ratio of a word.

Arguments:

  • string: string The input string for which to compute scores.

isZalgo(string[, detectionThreshold = 0.55]): boolean

Determines if the string consists of Zalgo text. Note that the occurrence of a combining character is not enough to trigger the detection. Instead, it computes a ratio for the input string and checks if it exceeds a given threshold. Thus, internationalized strings aren't automatically classified as Zalgo text.

Returns whether the string is a Zalgo text string.

Arguments:

  • string: string A string for which a Zalgo text check is run.
  • detectionThreshold: number = 0.55 A threshold ∈ [0, 1]. The higher the threshold, the more combining characters are needed for it to be detected as Zalgo text.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].