All Projects → tc39 → Proposal Regexp Unicode Property Escapes

tc39 / Proposal Regexp Unicode Property Escapes

Proposal to add Unicode property escapes `\p{…}` and `\P{…}` to regular expressions in ECMAScript.

Programming Languages

ecmascript
72 projects

Projects that are alternatives of or similar to Proposal Regexp Unicode Property Escapes

Grex
A command-line tool and library for generating regular expressions from user-provided test cases
Stars: ✭ 4,847 (+4227.68%)
Mutual labels:  regex, regexp, regular-expressions
moar
Deterministic Regular Expressions with Backreferences
Stars: ✭ 19 (-83.04%)
Mutual labels:  regex, regexp, regular-expressions
Regexpu
A source code transpiler that enables the use of ES2015 Unicode regular expressions in ES5.
Stars: ✭ 201 (+79.46%)
Mutual labels:  regex, unicode, regexp
Regex
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
Stars: ✭ 2,125 (+1797.32%)
Mutual labels:  regex, regexp, regular-expressions
Emoji Regex
A regular expression to match all Emoji-only symbols as per the Unicode Standard.
Stars: ✭ 1,134 (+912.5%)
Mutual labels:  regex, unicode, regexp
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (+82.14%)
Mutual labels:  regex, unicode, regexp
Regex For Regular Folk
🔍💪 Regular Expressions for Regular Folk — A visual, example-based introduction to RegEx [BETA]
Stars: ✭ 242 (+116.07%)
Mutual labels:  regex, regexp, regular-expressions
url-regex-safe
Regular expression matching for URL's. Maintained, safe, and browser-friendly version of url-regex. Resolves CVE-2020-7661 for Node.js servers.
Stars: ✭ 59 (-47.32%)
Mutual labels:  regex, regexp, regular-expressions
RgxGen
Regex: generate matching and non matching strings based on regex pattern.
Stars: ✭ 45 (-59.82%)
Mutual labels:  regex, regexp, regular-expressions
python-hyperscan
A CPython extension for the Hyperscan regular expression matching library.
Stars: ✭ 112 (+0%)
Mutual labels:  regex, regexp, regular-expressions
Youtube Regex
Best YouTube Video ID regex. Online: https://regex101.com/r/rN1qR5/2 and http://regexr.com/3anm9
Stars: ✭ 87 (-22.32%)
Mutual labels:  regex, regexp, regular-expressions
stringx
Drop-in replacements for base R string functions powered by stringi
Stars: ✭ 14 (-87.5%)
Mutual labels:  unicode, regex, regexp
Regenerate
Generate JavaScript-compatible regular expressions based on a given set of Unicode symbols or code points.
Stars: ✭ 306 (+173.21%)
Mutual labels:  unicode, regexp, regular-expressions
Regulex
🚧 Regular Expression Excited!
Stars: ✭ 4,877 (+4254.46%)
Mutual labels:  regex, regexp
Any Rule
🦕 常用正则大全, 支持web / vscode / idea / Alfred Workflow多平台
Stars: ✭ 5,708 (+4996.43%)
Mutual labels:  regex, regexp
Libfsm
DFA regular expression library & friends
Stars: ✭ 512 (+357.14%)
Mutual labels:  regex, regexp
Ugrep
🔍NEW ugrep v3.1: ultra fast grep with interactive query UI and fuzzy search: search file systems, source code, text, binary files, archives (cpio/tar/pax/zip), compressed files (gz/Z/bz2/lzma/xz/lz4), documents and more. A faster, user-friendly and compatible grep replacement.
Stars: ✭ 626 (+458.93%)
Mutual labels:  regex, unicode
Picomatch
Blazing fast and accurate glob matcher written JavaScript, with no dependencies and full support for standard and extended Bash glob features, including braces, extglobs, POSIX brackets, and regular expressions.
Stars: ✭ 393 (+250.89%)
Mutual labels:  regex, regexp
Onigmo
Onigmo is a regular expressions library forked from Oniguruma.
Stars: ✭ 536 (+378.57%)
Mutual labels:  regex, regexp
Commonregex
🍫 A collection of common regular expressions for Go
Stars: ✭ 733 (+554.46%)
Mutual labels:  regex, regexp

ECMAScript proposal: Unicode property escapes in regular expressions

Status

This proposal is at stage 4 of the TC39 process and is scheduled to be included in ES2018.

Motivation

The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used exclusively in the Greek script, search the Unicode database for symbols whose Script property is set to Greek.

There currently is no way to access these Unicode character properties natively in ECMAScript regular expressions. This makes it painful for developers to support full Unicode in their regular expressions. They currently have two options, neither of which is ideal:

  1. Use a library such as XRegExp to create the regular expressions at run-time:

    const regexGreekSymbol = XRegExp('\\p{Greek}', 'A');
    regexGreekSymbol.test('π');
    // → true
    

    The downside of this approach is that the XRegExp library is a run-time dependency which may not be ideal for performance-sensitive applications. For usage on the web, there is an additional load-time performance penalty: xregexp-all-min.js.gz takes up over 35 KB of space after minifying and applying gzip compression. Whenever the Unicode Standard is updated, a new version of XRegExp must be published and end users need to update their XRegExp copy in order to use the latest available data.

  2. Use a library such as Regenerate to generate the regular expression at build time:

    const regenerate = require('regenerate');
    const codePoints = require('unicode-9.0.0/Script/Greek/code-points.js');
    const set = regenerate(codePoints);
    set.toString();
    // → '[\u0370-\u0373\u0375-\u0377\u037A-\u037D\u037F\u0384\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6-\u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2126\uAB65]|\uD800[\uDD40-\uDD8E\uDDA0]|\uD834[\uDE00-\uDE45]'
    // Imagine there’s more code here to save this pattern to a file.
    

    This approach results in optimal run-time performance, although the generated regular expressions tend to be fairly large in size (which could lead to load-time performance problems on the web). The biggest downside is that it requires a build script, which gets painful as the developer needs more Unicode-aware regular expressions. Whenever the Unicode Standard is updated, the build script must be updated and its results must be deployed in order to use the latest available data.

Proposed solution

We propose the addition of Unicode property escapes of the form \p{…} and \P{…}. Unicode property escapes are a new type of escape sequence available in regular expressions that have the u flag set. With this feature, the above regular expression could be written as:

const regexGreekSymbol = /\p{Script=Greek}/u;
regexGreekSymbol.test('π');
// → true

This proposal solves all the abovementioned problems:

  • It is no longer painful to create Unicode-aware regular expressions.
  • There is no dependency on run-time libraries.
  • The regular expressions patterns are compact and readable — no more file size bloat.
  • Creating a script that generates the regular expression at build time is no longer necessary.
  • Code that uses Unicode property escapes stays up-to-date “automatically” from the developer’s point of view: whenever the Unicode Standard gets an update, the ECMAScript engine updates its data.

High-level API

Unicode property escapes for non-binary Unicode properties look like this:

\p{UnicodePropertyName=UnicodePropertyValue}

The aliases defined in PropertyAliases.txt and PropertyValueAliases.txt may be used instead of the canonical property and value names. The use of an unknown property name or value triggers an early SyntaxError.

For binary properties, the following syntax is available:

\p{LoneUnicodePropertyNameOrValue}

This syntax may also be used as a shorthand for General_Category values, e.g. \p{Letter} instead of \p{General_Category=Letter}.

\P{…} is the negated form of \p{…}.

Implementations must support the list of Unicode properties and their property aliases mentioned in the spec proposal. This includes General_Category, Script, Script_Extensions, and some binary properties (including but not limited to Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, Any, ASCII, Assigned, ID_Start, ID_Continue, Join_Control, Emoji_Presentation, Emoji_Modifier, Emoji_Modifier_Base, etc.). This is a superset of what UTS18 RL1.2 requires. To ensure interoperability, implementations must not extend Unicode property support to the remaining properties.

FAQ

What about backwards compatibility?

In regular expressions without the u flag, the pattern \p is an (unnecessary) escape sequence for p. Patterns of the form \p{Letter} might already be present in existing regular expressions without the u flag, and therefore we cannot assign new meaning to such patterns without breaking backwards compatibility.

For this reason, ECMAScript 2015 made unnecessary escape sequences like \p and \P throw an exception when the u flag is set. This enables us to change the meaning of \p{…} and \P{…} in regular expressions with the u flag without breaking backwards compatibility.

Why not support loose matching?

UAX44-LM3 specifies the loose matching rules for comparing Unicode property and value aliases.

Ignore case, whitespace, underscores, hyphens, […]

Loose matching makes \p{lB=Ba} equivalent to \p{Line_Break=Break_After} or /\p{___lower C-A-S-E___}/u equivalent to /\p{Lowercase}/u. We assert that this feature does not add any value, and in fact harms code readability and maintainability.

Should the need arise, then support for loose matching can always be added later, as part of a separate ECMAScript proposal. If we add it now, however, there is no going back.

Why not support the is prefix?

UAX44-LM3 specifies the loose matching rules for comparing Unicode property and value aliases, one of which is:

Ignore […] any initial prefix string is.

This rule makes Script=IsGreek and IsScript=Greek equivalent to Script=Greek. We assert that this feature does not add any value, and in fact harms code readability. It introduces ambiguity and increases implementation complexity, since some property values or aliases already start with is, e.g. Decomposition_Type=Isolated and Line_Break=IS which is an alias for Line_Break=Infix_Numeric.

Compatibility with Unicode property escapes in other languages is not an argument either, since no existing regular expression engine seems to implement the is prefix exactly as described in UAX44-LM3, and those that partially implement it wildly differ in behavior.

Strictness is preferred over ambiguity.

Should the need arise, then support for the is prefix can always be added later, as part of a separate ECMAScript proposal. If we add it now, however, there is no going back.

Why not support e.g. \pL as a shorthand for \p{L}?

This shorthand doesn’t add any value and as such the added implementation complexity (small as it may be) isn’t worth it. \p{L} works; there’s no reason to introduce another syntax for it other than compatibility with other languages which is an utopian goal anyhow.

Should the need arise, then support for this shorthand can always be added later, as part of a separate ECMAScript proposal. If we add it now, however, there is no going back.

Why use = (and not something else) as a separator?

The = in \p{…=…} aligns with the = in (?=…) for positive lookaheads and (?<=…) for positive lookbehinds. Also, = is what most regular expression engines use as a separator. See issue #8 for more information.

Why not support : as a separator in addition to =?

Supporting multiple separators doesn’t add any value and as such the added implementation complexity (small as it may be) isn’t worth it. \p{Script_Extensions=Greek} works; there’s no reason to introduce another syntax for it other than compatibility with other languages which is an utopian goal anyhow.

Should the need arise, then support for the : separator can always be added later, as part of a separate ECMAScript proposal. If we add it now, however, there is no going back.

Why not support e.g. \p{ScriptName} as a shorthand for \p{Script=ScriptName}?

In the majority of use cases, Script_Extensions should be used over Script. UTS24 explains this nicely with practical examples. As such, it would make more sense to add a shorthand for Script_Extensions than for Script. Doing either would cause confusion, however, since the sets of values for these two properties are identical. For example, it wouldn’t be clear if \p{Old_Persian} refers to the Script or Script_Extensions with that name.

Why not overload \u{…} instead of adding \p{…} and \P{…}?

The main argument in favor of overloading \u{…} is that it hints that it is Unicode. We assert that this hint is unnecessary, as the required u flag on the regular expression already indicates Unicode.

The p in \p{…} stands for “property”. Combined with the u flag, this indicates nicely that the expression within the braces relates to a Unicode property.

Overloading \u{…} introduces an ambiguity. Imagine a new binary property or general category named Beef is added to the Unicode Standard. Since Beef consists of hexadecimal digits only ([A-Fa-f0-9]), it’s unclear whether \u{Beef} is a code point escape sequence for U+BEEF HANGUL SYLLABLE BBEGS or whether it’s a property escape sequence referring to the property/category named Beef.

Existing other languages with support for Unicode property escapes use \p{…} and \P{…}. Although compatibility with these other implementations is a non-goal (since they’re not compatible amongst themselves to begin with), it makes sense to follow the tradition here and re-use the base syntax that developers are already familiar with.

Why not support the Name property (\p{Name=…})?

Developers already have a way to refer to a specific symbol without having to use that symbol in their source code: Unicode code point escapes of the form \u{1D306}. As such, the need to support \p{Name=TETRAGRAM FOR CENTRE} is not strong enough to warrant inclusion in this proposal.

Support for the Name property can always be added later, as part of a separate ECMAScript proposal. If we add it now, however, there is no going back.

Illustrative examples

Unicode-aware version of \d

To match any decimal number in Unicode rather than just ASCII [0-9], use \p{Decimal_Number} instead of \d as per UTS18.

const regex = /^\p{Decimal_Number}+$/u;
regex.test('𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼');
// → true

Unicode-aware version of \D

To match any Unicode symbol that is not a decimal number rather than just [^0-9], use \P{Decimal_Number} instead of \D.

const regex = /^\P{Decimal_Number}+$/u;
regex.test('Իմ օդաթիռը լի է օձաձկերով');
// → true

Unicode-aware version of \w

To match any word symbol in Unicode rather than just ASCII [a-zA-Z0-9_], use [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] as per UTS18.

const regex = /([\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]+)/gu;
const text = `
Amharic: የኔ ማንዣበቢያ መኪና በዓሣዎች ተሞልቷል
Bengali: আমার হভারক্রাফ্ট কুঁচে মাছ-এ ভরা হয়ে গেছে
Georgian: ჩემი ხომალდი საჰაერო ბალიშზე სავსეა გველთევზებით
Macedonian: Моето летачко возило е полно со јагули
Vietnamese: Tàu cánh ngầm của tôi đầy lươn
`;

let match;
while (match = regex.exec(text)) {
  const word = match[1];
  console.log(`Matched word with length ${ word.length }: ${ word }`);
}

Console output:

Matched word with length 7: Amharic
Matched word with length 2: የኔ
Matched word with length 6: ማንዣበቢያ
Matched word with length 3: መኪና
Matched word with length 5: በዓሣዎች
Matched word with length 5: ተሞልቷል
Matched word with length 7: Bengali
Matched word with length 4: আমার
Matched word with length 11: হভারক্রাফ্ট
Matched word with length 5: কুঁচে
Matched word with length 3: মাছ
Matched word with length 1: এ
Matched word with length 3: ভরা
Matched word with length 3: হয়ে
Matched word with length 4: গেছে
Matched word with length 8: Georgian
Matched word with length 4: ჩემი
Matched word with length 7: ხომალდი
Matched word with length 7: საჰაერო
Matched word with length 7: ბალიშზე
Matched word with length 6: სავსეა
Matched word with length 12: გველთევზებით
Matched word with length 10: Macedonian
Matched word with length 5: Моето
Matched word with length 7: летачко
Matched word with length 6: возило
Matched word with length 1: е
Matched word with length 5: полно
Matched word with length 2: со
Matched word with length 6: јагули
Matched word with length 10: Vietnamese
Matched word with length 3: Tàu
Matched word with length 4: cánh
Matched word with length 4: ngầm
Matched word with length 3: của
Matched word with length 3: tôi
Matched word with length 3: đầy
Matched word with length 4: lươn

Unicode-aware version of \W

To match any non-word symbol in Unicode rather than just [^a-zA-Z0-9_], use [^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}].

Matching emoji

To match emoji symbols, the binary properties from UTR51 come in handy.

const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;

This regular expression matches, from left to right:

  1. emoji with optional modifiers (\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?);
  2. any remaining symbols that render as emoji rather than text by default (\p{Emoji_Presentation});
  3. symbols that render as text by default, but are forced to render as emoji using U+FE0F VARIATION SELECTOR-16 (\p{Emoji}\uFE0F).
const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
const text = `
\u{231A}: ⌚ default emoji presentation character (Emoji_Presentation)
\u{2194}\u{FE0F}: ↔️ default text presentation character rendered as emoji
\u{1F469}: 👩 emoji modifier base (Emoji_Modifier_Base)
\u{1F469}\u{1F3FF}: 👩🏿 emoji modifier base followed by a modifier
`;

let match;
while (match = regex.exec(text)) {
  const emoji = match[0];
  console.log(`Matched sequence ${ emoji } — code points: ${ [...emoji].length }`);
}

Console output:

Matched sequence ⌚ — code points: 1
Matched sequence ⌚ — code points: 1
Matched sequence ↔️ — code points: 2
Matched sequence ↔️ — code points: 2
Matched sequence 👩 — code points: 1
Matched sequence 👩 — code points: 1
Matched sequence 👩🏿 — code points: 2
Matched sequence 👩🏿 — code points: 2

Other examples

Match any numeric symbol in Unicode, including non-decimal symbols such as Roman numerals:

const regex = /^\p{Number}+$/u;
regex.test('²³¹¼½¾𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼㉛㉜㉝ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ');
// → true

Match ECMAScript IdentifierStart or IdentifierPart symbols without the need for complex regular expressions generated by build scripts:

const regexIdentifierStart = /[$_\p{ID_Start}]/u;
const regexIdentifierPart = /[$_\u200C\u200D\p{ID_Continue}]/u;
const regexIdentifierName = /^(?:[$_\p{ID_Start}])(?:[$_\u200C\u200D\p{ID_Continue}])*$/u;

Specification

Implementations

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].