All Projects → fancy-regex → Fancy Regex

fancy-regex / Fancy Regex

Licence: mit
Rust library for regular expressions using "fancy" features like look-around and backreferences

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Fancy Regex

Regex
The Hoa\Regex library.
Stars: ✭ 308 (+54.77%)
Mutual labels:  regex, regular-expressions
Inferregex
Infer the regular expression (regex) of a string 🔤 🔢 🔍
Stars: ✭ 41 (-79.4%)
Mutual labels:  regex, regular-expressions
Social Media Profiles Regexs
📇 Extract social media profiles and more with regular expressions
Stars: ✭ 324 (+62.81%)
Mutual labels:  regex, regular-expressions
crystular
Crystal regex tester http://www.crystular.org/
Stars: ✭ 31 (-84.42%)
Mutual labels:  regex, regular-expressions
Proposal Regexp Unicode Property Escapes
Proposal to add Unicode property escapes `\p{…}` and `\P{…}` to regular expressions in ECMAScript.
Stars: ✭ 112 (-43.72%)
Mutual labels:  regex, regular-expressions
Pawn.Regex
🔎 Plugin that adds support for regular expressions in Pawn
Stars: ✭ 34 (-82.91%)
Mutual labels:  regex, regular-expressions
Verbalex
A library for creating complex, composable regular expressions with the reader & writer in mind. 🔍
Stars: ✭ 26 (-86.93%)
Mutual labels:  regex, regular-expressions
python-hyperscan
A CPython extension for the Hyperscan regular expression matching library.
Stars: ✭ 112 (-43.72%)
Mutual labels:  regex, regular-expressions
Npeg
PEGs for Nim, another take
Stars: ✭ 163 (-18.09%)
Mutual labels:  regex, regular-expressions
Youtube Regex
Best YouTube Video ID regex. Online: https://regex101.com/r/rN1qR5/2 and http://regexr.com/3anm9
Stars: ✭ 87 (-56.28%)
Mutual labels:  regex, regular-expressions
RgxGen
Regex: generate matching and non matching strings based on regex pattern.
Stars: ✭ 45 (-77.39%)
Mutual labels:  regex, regular-expressions
Regex
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
Stars: ✭ 2,125 (+967.84%)
Mutual labels:  regex, regular-expressions
lc-data-intro
Library Carpentry: Introduction to Working with Data (Regular Expressions)
Stars: ✭ 16 (-91.96%)
Mutual labels:  regex, regular-expressions
Re Flex
The regex-centric, fast lexical analyzer generator for C++ with full Unicode support. Faster than Flex. Accepts Flex specifications. Generates reusable source code that is easy to understand. Introduces indent/dedent anchors, lazy quantifiers, functions for lex/syntax error reporting, and more. Seamlessly integrates with Bison and other parsers.
Stars: ✭ 274 (+37.69%)
Mutual labels:  regex, regular-expressions
unmatcher
Regular expressions reverser for Python
Stars: ✭ 26 (-86.93%)
Mutual labels:  regex, regular-expressions
Py regular expressions
Learn Python Regular Expressions step by step from beginner to advanced levels
Stars: ✭ 770 (+286.93%)
Mutual labels:  regex, regular-expressions
expressive-ts
A functional programming library designed to simplify building complex regular expressions
Stars: ✭ 78 (-60.8%)
Mutual labels:  regex, regular-expressions
tokenquery
TokenQuery (regular expressions over tokens)
Stars: ✭ 28 (-85.93%)
Mutual labels:  regex, regular-expressions
Ore
An R interface to the Onigmo regular expression library
Stars: ✭ 54 (-72.86%)
Mutual labels:  regex, regular-expressions
Nim Regex
Pure Nim regex engine. Guarantees linear time matching
Stars: ✭ 136 (-31.66%)
Mutual labels:  regex, regular-expressions

fancy-regex

A Rust library for compiling and matching regular expressions. It uses a hybrid regex implementation designed to support a relatively rich set of features. In particular, it uses backtracking to implement "fancy" features such as look-around and backtracking, which are not supported in purely NFA-based implementations (exemplified by RE2, and implemented in Rust in the regex crate).

docs crate ci codecov

A goal is to be as efficient as possible. For a given regex, the NFA implementation has asymptotic running time linear in the length of the input, while in the general case a backtracking implementation has exponential blowup. An example given in Static Analysis for Regular Expression Exponential Runtime via Substructural Logics is:

import re
re.compile('(a|b|ab)*bc').match('ab' * 28 + 'ac')

In Python (tested on both 2.7 and 3.5), this match takes 91s, and doubles for each additional repeat of 'ab'.

Thus, many proponents advocate a purely NFA (nondeterministic finite automaton) based approach. Even so, backreferences and look-around do add richness to regexes, and they are commonly used in applications such as syntax highlighting for text editors. In particular, TextMate's syntax definitions, based on the Oniguruma backtracking engine, are now used in a number of other popular editors, including Sublime Text and Atom. These syntax definitions routinely use backreferences and look-around. For example, the following regex captures a single-line Rust raw string:

r(#*)".*?"\1

There is no NFA that can express this simple and useful pattern. Yet, a backtracking implementation handles it efficiently.

This package is one of the first that handles both cases well. The exponential blowup case above is run in 258ns. Thus, it should be a very appealing alternative for applications that require both richness and performance.

A warning about worst-case performance

NFA-based approaches give strong guarantees about worst-case performance. For regexes that contain "fancy" features such as backreferences and look-around, this module gives no corresponding guarantee. If an attacker can control the regular expressions that will be matched against, they will be able to successfully mount a denial-of-service attack. Be warned.

See PERFORMANCE.md for some examples.

A hybrid approach

One workable approach is to detect the presence of "fancy" features, and choose either an NFA implementation or a backtracker depending on whether they are used.

However, this module attempts to be more fine-grained. Instead, it implements a true hybrid approach. In essence, it is a backtracking VM (as well explained in Regular Expression Matching: the Virtual Machine Approach) in which one of the "instructions" in the VM delegates to an inner NFA implementation (in Rust, the regex crate, though a similar approach would certainly be possible using RE2 or the Go regexp package). Then there's an analysis which decides for each subexpression whether it is "hard", or can be delegated to the NFA matcher. At the moment, it is eager, and delegates as much as possible to the NFA engine.

Theory

(This section is written in a somewhat informal style; I hope to expand on it)

The fundamental idea is that it's a backtracking VM like PCRE, but as much as possible it delegates to an "inner" RE engine like RE2 (in this case, the Rust one). For the sublanguage not using fancy features, the library becomes a thin wrapper.

Otherwise, you do an analysis to figure out what you can delegate and what you have to backtrack. I was thinking it might be tricky, but it's actually quite simple. The first phase, you just label each subexpression as "hard" (groups that get referenced in a backref, look-around, etc), and bubble that up. You also do a little extra analysis, mostly determining whether an expression has constant match length, and the minimum length.

The second phase is top down, and you carry a context, also a boolean indicating whether it's "hard" or not. Intuitively, a hard context is one in which the match length will affect future backtracking.

If the subexpression is easy and the context is easy, generate an instruction in the VM that delegates to the inner NFA implementation. Otherwise, generate VM code as in a backtracking engine. Most expression nodes are pretty straightforward; the only interesting case is concat (a sequence of subexpressions).

Even that one is not terribly complex. First, determine a prefix of easy nodes of constant match length (this won't affect backtracking, so safe to delegate to NFA). Then, if your context is easy, determine a suffix of easy nodes. Both of these delegate to NFA. For the ones in between, recursively compile. In an easy context, the last of these also gets an easy context; everything else is generated in a hard context. So, conceptually, hard context flows from right to left, and from parents to children.

Current status

Still in development, though the basic ideas are in place. Currently, the following features are missing:

  • Replace methods like replace, replacen and replaceall
  • Procedure calls and recursive expressions

Acknowledgements

Many thanks to Andrew Gallant for stimulating conversations that inspired this approach, as well as for creating the excellent regex crate.

Authors

The main author is Raph Levien, with many contributions from Robin Stocker.

Contributions

We gladly accept contributions via GitHub pull requests. Please see CONTRIBUTING.md for more details.

This project started out as a Google 20% project, but none of the authors currently work at Google so it has been forked to be community-maintained.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].