unzalgo

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.

Installation

$ npm install unzalgo

About

You can use unzalgo to both detect Zalgo text and transform it back into normal text without breaking internationalization. For example, you could transform:

T͘H͈̩̬̺̩̭͇I͏̼̪͚̪͚S͇̬̺ ́E̬̬͈̮̻̕V҉̙I̧͖̜̹̩̞̱L͇͍̝ ̺̮̟̙̘͎U͝S̞̫̞͝E͚̘͝R IṊ͍̬͞P̫Ù̹̳̝͓̙̙T̜͕̺̺̳̘͝

into

THIS EVIL USER INPUT

while also keeping

thiŝ te̅xt unchanged, since some lângûaĝes aĉtuallŷ uŝe thêse sŷmbo̅ls,

and, at the same time, keep all diacritics in

Z nich ovšem pouze předposlední sdílí s výše uvedenou větou příliš žluťoučký kůň úpěl […]

which remains unchanged after a transformation.

Is there a demo?

Yes! You can check it out here. You can edit the text at the top; the lower part shows the text after clean using the default threshold.

How does it work?

In Unicode, every character is assigned to a character category. Zalgo text uses characters that belong to the categories Mn (Mark, Nonspacing) or Me (Mark, Enclosing).

First, the text is divided into words; each word is then assigned to a score that corresponds to the usage of the categories above, combined with small use of statistics. If the score exceeds a threshold, we're able to detect Zalgo text (which allows us to strip away all characters from the above categories).

Getting started

Regular cleaning

import { clean } from "unzalgo";
assert("this" === clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋"));

Partial cleaning

import { clean } from "unzalgo";
assert("Somê̋̂͠ Zalgó̈́̅͠ text.̕͘̕͘͠" === clean("S̷̡̡̡̱̦̣̹̭̻͔̣͖̤̜̮̓̒̋̋͌̄̊̄̎̓o̷̡̢̨̗͔̤̫͙̖̙͉̲̘͙͔͖̤͎̙͓̳̣̳̣͋̀̇́̈̏̾̇̇̀̎̔̓̇̆͝ḿ̸̨̮̟̣̙̮̲̭͎͓͖̘̒̀͌͆͊̿̾̄̽̀̔̈́̍̒͒̔̕͝e̷̛͖̤͍̬͖̔̈̆̐͌̃̓͌̽͑̾̐̇̑̇̈̂̋̂͠ ̵̢̨̞͕̥̯̼͈̺̖̞̥̳̤̓̇̓̓̈͆Z̷̡̬̱̺̘̹̭̙̭͚̼̝̤̳̦̲̬̜͌͌̊̆ͅͅa̷̢̡̺͕͈̰̮̲͔̙̱̼͉̲̼̝̝̻̹̱̹̝̗̿̋͐̊͑͑͐̽̆̉̓͋̽̅̈́̚͜͝l̶͇͍͈̞̠̜͕͒̑͆̇̊̚͝g̸̨̛͎͚͚̗̘͙͔͓̠̝͔̬̳̗̯̮͍̻̥̃͊̏̐̌͒̀̓͛͠ȭ̷̢̡̧̤͇̮͕̘̱̬̖̪͈̘̟̉͑͌̑̿̇̊̿͛͊̎͌̀̽͛͋̃̑́̈́̈́̅͠ ̷̛͙͙̜̫̼̙̯́̉̊̿̈́́̽͛̓̓̊̓̋̏̀͌͠ͅt̷̠̞̯̤̃̇̒̾͒̑͋̒̈́͋͗̉̉͐̍̾͑̈́̈́͌͆̀̂̋͌͜ȩ̶̧̢̡̛̛̣͕̥͕͇̖͈̗͍̖̠͚̮͙̅̂̌́̐͛͗̽͋́̿͂̅̒͌̐͆̏̕͜͝͝͝͠x̶̡̧̛͚̗̖̙͚͍̻̙̥͓͖͕͍̮͖̙̙̜͓͈̩̯̐͛̏̍́͌̏̂̀̐͛͂̈́̆́̀̒̉̾̈́͌͘͜͠t̸̨̨̲̟͎̩̱̹̬͙̩̠͇̪͒̃̒͛̍̎̂͒̀́́̍.̵̮͐̋̐͐̅̿̿́̊͑́̂͗͂̊̽̚̕̕͘̕͘͠", {
	targetDensity: 0.5
}));

Configuring detection

import { clean } from "unzalgo";
/* Clean only if there are no "normal" characters in the word (t, h, i and s are "normal") */
assert("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋" === clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋", {
	detectionThreshold: 1
}));

/* Clean only if there is at least one combining character */
import { clean } from "unzalgo";
assert("francais" === clean("français", {
	detectionThreshold: 0
}));

import { clean } from "unzalgo";
/* `français` remains intact by default */
assert("français" === clean("français"));

Internationalization

import { isZalgo } from "unzalgo";
/* "français" is not a Zalgo text, of course */
assert(isZalgo("français") === false);

import { isZalgo } from "unzalgo";
/* Unless you define the Zalgo property as containing combining characters */
assert(isZalgo("français", 0) === true);

import { isZalgo } from "unzalgo";
/* You can also define the Zalgo property as consisting of nothing but combining characters */
assert(isZalgo("français", 1) === false);

Detection threshold

Some of this library's functions accept a detectionThreshold option that lets you configure how sensitively unzalgo behaves. It is a number from 0 to 1 and defaults to 0.55.

A detection threshold of 0 indicates that a string should be classified as Zalgo text if at least 0 % of its codepoints have the Unicode category Mn or Me.

A detection threshold of 1 indicates that a string should be classified as Zalgo text if at least 100 % of its codepoints have the Unicode category Mn or Me.

Exports

clean(string[, options]): string [default export]

Removes all combining characters for every word in a string if the word is classified as Zalgo text. If targetDensity is specified, not all the Zalgo characters will be removed. Instead, they will be thinned out uniformly.

Returns a cleaned, more readable string.

Arguments:

string: string A string for which combining characters are removed for every word whose Zalgo property is met.
options: object An object of options.
options.detectionThreshold: number = 0.55 A threshold ∈ [0, 1]. The higher the threshold, the more combining characters are needed for it to be detected as Zalgo text.
options.targetDensity: number = 0 A threshold ∈ [0, 1]. The density is defined as the ratio of combining characters to all characters. The higher the density, the more combining characters will be part of the result. The result is guaranteed to have a density that is less than or equal to the one provided. A target density of 0 indicates that none of the combining characters are allowed to be part of the result. A target density of 1 indicates that all combining characters should be part of the result. Any number in between those limits will attempt to uniformly reduce the occurrence of combining characters. This reduction step assumes that the combining characters are uniformly distributed in the input string.

computeScores(string): number[]

Computes a score ∈ [0, 1] for every word in the input string. Each score represents the ratio of Zalgo characters to total characters in a word.

Returns An array of scores where each score describes the Zalgo ratio of a word.

Arguments:

string: string The input string for which to compute scores.

isZalgo(string[, detectionThreshold = 0.55]): boolean

Determines if the string consists of Zalgo text. Note that the occurrence of a combining character is not enough to trigger the detection. Instead, it computes a ratio for the input string and checks if it exceeds a given threshold. Thus, internationalized strings aren't automatically classified as Zalgo text.

Returns whether the string is a Zalgo text string.

Arguments:

string: string A string for which a Zalgo text check is run.
detectionThreshold: number = 0.55 A threshold ∈ [0, 1]. The higher the threshold, the more combining characters are needed for it to be detected as Zalgo text.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

kdex / unzalgo

Programming Languages

Labels

Projects that are alternatives of or similar to unzalgo

unzalgo

Installation

About

Is there a demo?

How does it work?

Getting started

Regular cleaning

Partial cleaning

Configuring detection

Internationalization

Detection threshold

Exports

clean(string[, options]): string [default export]

computeScores(string): number[]

isZalgo(string[, detectionThreshold = 0.55]): boolean