All Projects → ddelange → retrie

ddelange / retrie

Licence: MIT license
Efficient Trie-based regex unions for blacklist/whitelist filtering and one-pass mapping-based string replacing

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to retrie

CyberSecurity-Box
Firewall-System based on OpenWRT or Pi-Hole with UnBound, TOR, optional Privoxy, opt. ntopng and opt. Configuration of the AVM FRITZ!Box with Presets for Security and Port-List. Please visit:
Stars: ✭ 20 (-42.86%)
Mutual labels:  whitelist, blacklist, regexp
AntiBot
Lightweight BungeeCord plugin that aims to stop attacks on your server quickly and efficiently.
Stars: ✭ 42 (+20%)
Mutual labels:  whitelist, blacklist
RegExp-Learning
学习正则表达式
Stars: ✭ 30 (-14.29%)
Mutual labels:  regex, regexp
expand-brackets
Expand POSIX bracket expressions (character classes) in glob patterns.
Stars: ✭ 26 (-25.71%)
Mutual labels:  regex, regexp
jsCast
📻 An Audio Streaming Application written in JavaScript
Stars: ✭ 23 (-34.29%)
Mutual labels:  whitelist, blacklist
trie-mux
A minimal and powerful trie based url path router (or mux) for Go.
Stars: ✭ 25 (-28.57%)
Mutual labels:  regexp, trie
python-hyperscan
A CPython extension for the Hyperscan regular expression matching library.
Stars: ✭ 112 (+220%)
Mutual labels:  regex, regexp
regexp-expand
Show the ELisp regular expression at point in rx form.
Stars: ✭ 18 (-48.57%)
Mutual labels:  regex, regexp
globrex
Glob to regular expression with support for extended globs.
Stars: ✭ 52 (+48.57%)
Mutual labels:  regex, regexp
magento-2-security
Magento 2 Security extension FREE. Security extension gives store owners the ability to detect the IP addresses that are intentionally attacking their store at any given time. Therefore, they have timely measures to prevent this issue such as blocking those IP addresses or sending warning emails to store owners.
Stars: ✭ 40 (+14.29%)
Mutual labels:  whitelist, blacklist
RgxGen
Regex: generate matching and non matching strings based on regex pattern.
Stars: ✭ 45 (+28.57%)
Mutual labels:  regex, regexp
regXwild
⏱ Superfast ^Advanced wildcards++? | Unique algorithms that was implemented on native unmanaged C++ but easily accessible in .NET via Conari (with caching of 0x29 opcodes +optimizations) etc.
Stars: ✭ 20 (-42.86%)
Mutual labels:  regex, regexp
IronRure
.NET Bindings to the Rust Regex Crate
Stars: ✭ 16 (-54.29%)
Mutual labels:  regex, regexp
stringx
Drop-in replacements for base R string functions powered by stringi
Stars: ✭ 14 (-60%)
Mutual labels:  regex, regexp
accomplist
ACCOMPLIST - List Compiler
Stars: ✭ 51 (+45.71%)
Mutual labels:  whitelist, blacklist
is-regex
Is this value a JS regex?
Stars: ✭ 22 (-37.14%)
Mutual labels:  regex, regexp
url-regex-safe
Regular expression matching for URL's. Maintained, safe, and browser-friendly version of url-regex. Resolves CVE-2020-7661 for Node.js servers.
Stars: ✭ 59 (+68.57%)
Mutual labels:  regex, regexp
cregex
A small implementation of regular expression matching engine in C
Stars: ✭ 72 (+105.71%)
Mutual labels:  regex, regexp
js-diacritic-regex
Creates the inverse of transliterated string to a regex. What? Basically, diacritic insensitiveness
Stars: ✭ 20 (-42.86%)
Mutual labels:  regex, regexp
librxvm
non-backtracking NFA-based regular expression library, for C and Python
Stars: ✭ 57 (+62.86%)
Mutual labels:  regex, regexp

retrie

build codecov pypi Version python downloads black

retrie offers fast methods to match and replace (sequences of) strings based on efficient Trie-based regex unions.

Trie

Instead of matching against a simple regex union, which becomes slow for large sets of words, a more efficient regex pattern can be compiled using a Trie structure:

from retrie.trie import Trie


trie = Trie()

for term in ["abc", "foo", "abs"]:
    trie.add(term)
assert trie.pattern() == "(?:ab[cs]|foo)"  # equivalent to but faster than "(?:abc|abs|foo)"

trie.add("absolute")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?)|foo)"

trie.add("abx")
assert trie.pattern() == "(?:ab(?:[cx]|s(?:olute)?)|foo)"

trie.add("abxy")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?|xy?)|foo)"

Installation

This pure-Python, OS independent package is available on PyPI:

$ pip install retrie

Usage

The following objects are all subclasses of retrie.retrie.Retrie, which handles filling the Trie and compiling the corresponding regex pattern.

Blacklist

The Blacklist object can be used to filter out bad occurences in a text or a sequence of strings:

from retrie.retrie import Blacklist

# check out docstrings and methods
help(Blacklist)

blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=False)
blacklist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good", "foobar")
assert blacklist.cleanse_text(("good abc foobar")) == "good  foobar"

blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=True)
blacklist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good",)
assert blacklist.cleanse_text(("good abc foobar")) == "good  bar"

Whitelist

Similar methods are available for the Whitelist object:

from retrie.retrie import Whitelist

# check out docstrings and methods
help(Whitelist)

whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=False)
whitelist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc",)
assert whitelist.cleanse_text(("bad abc foobar")) == "abc"

whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=True)
whitelist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc", "foobar")
assert whitelist.cleanse_text(("bad abc foobar")) == "abcfoo"

Replacer

The Replacer object does a fast single-pass search & replace for occurrences of replacement_mapping.keys() with corresponding values.

from retrie.retrie import Replacer

# check out docstrings and methods
help(Replacer)

replacement_mapping = dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"]))

replacer = Replacer(replacement_mapping, match_substrings=True)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... new2bar"

replacer = Replacer(replacement_mapping, match_substrings=False)
replacer.compiled
# re.compile(r'\b(?:ab[cs]|foo)\b', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... foobar"

replacer = Replacer(replacement_mapping, match_substrings=False, re_flags=None)
replacer.compiled  # on py3, re.UNICODE is always enabled
# re.compile(r'\b(?:ab[cs]|foo)\b')
assert replacer.replace("ABS ...foo... foobar") == "ABS ...new2... foobar"

replacer = Replacer(replacement_mapping, match_substrings=False, word_boundary=" ")
replacer.compiled
# re.compile(r'(?<= )(?:ab[cs]|foo)(?= )', re.IGNORECASE|re.UNICODE)
assert replacer.replace(". ABS ...foo... foobar") == ". new3 ...foo... foobar"

Development

gitmoji pre-commit

Run make help for options like installing for development, linting and testing.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].