All Projects → legalforce-research → daachorse

legalforce-research / daachorse

Licence: other
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to daachorse

Aho Corasick
A fast implementation of Aho-Corasick in Rust.
Stars: ✭ 424 (+465.33%)
Mutual labels:  search, finite-state-machine, text-processing
BM
The Utility to Install Songs, Install Mods, Install/Update BMBF, Install HitSounds, download automatically made Playlists, get better support, switch between the modded and unmodded Version of Beat Saber, do full Backups and way more
Stars: ✭ 33 (-56%)
Mutual labels:  search
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+89.33%)
Mutual labels:  text-processing
typesense-go
Go client for Typesense: https://github.com/typesense/typesense
Stars: ✭ 61 (-18.67%)
Mutual labels:  search
food-help
A clone of popular food and business review web app yelp
Stars: ✭ 24 (-68%)
Mutual labels:  search
trembita
Model complex data transformation pipelines easily
Stars: ✭ 44 (-41.33%)
Mutual labels:  finite-state-machine
text
Qiniu Text Processing Libraries for Go
Stars: ✭ 25 (-66.67%)
Mutual labels:  text-processing
actus
A monorepo for a self learning command palette driven by a final state machine implemented in XState.
Stars: ✭ 43 (-42.67%)
Mutual labels:  finite-state-machine
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (-36%)
Mutual labels:  text-processing
v-oogle
Google.com, reVued. 🔎
Stars: ✭ 40 (-46.67%)
Mutual labels:  search
SimpleStateMachineLibrary
📚 A simple library for realization state machines in C# code
Stars: ✭ 30 (-60%)
Mutual labels:  finite-state-machine
ci4-album
🔥 CodeIgniter 4 example Album module uses Domain Driven Design Architecture with Tactical Pattern
Stars: ✭ 67 (-10.67%)
Mutual labels:  search
InteractiveCodeSearch.jl
Interactively search Julia code from terminal
Stars: ✭ 74 (-1.33%)
Mutual labels:  search
rgpipe
lesspipe for ripgrep for common new filetypes using few dependencies
Stars: ✭ 21 (-72%)
Mutual labels:  search
svelte-search
Accessible, customizable Svelte search component
Stars: ✭ 17 (-77.33%)
Mutual labels:  search
gnu-linux-shell-scripting
A foundation for GNU/Linux shell scripting
Stars: ✭ 23 (-69.33%)
Mutual labels:  text-processing
NLP-tools
Useful python NLP tools (evaluation, GUI interface, tokenization)
Stars: ✭ 39 (-48%)
Mutual labels:  text-processing
subst
Search and des... argh... replace in many files at once. Use regexp and power of Python to replace what you want.
Stars: ✭ 20 (-73.33%)
Mutual labels:  search
adif
用标准c语言开发的常用数据结构和算法基础库,作为应用程序开发接口基础库,为编写高性能程序提供便利,可极大地缩短软件项目的开发周期,提升工程开发效率,并确保软件系统运行的可靠性、稳定性。
Stars: ✭ 33 (-56%)
Mutual labels:  aho-corasick
svelte-algolia
Svelte plugin for keeping Algolia indices in sync with custom data fetching functions.
Stars: ✭ 17 (-77.33%)
Mutual labels:  search

🐎 daachorse: Double-Array Aho-Corasick

A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.

Crates.io Documentation Build Status

Overview

Daachorse is a crate for fast multiple pattern matching using the Aho-Corasick algorithm, running in linear time over the length of the input text. For time- and memory-efficiency, the pattern match automaton is implemented using the compact double-array data structure. The data structure not only supports constant-time state-to-state traversal, but also represents each state in a compact space of only 12 bytes.

For example, compared to the NFA of the aho-corasick crate that is the most poplar Aho-Corasick implementation in Rust, Daachorse can perform pattern matching 3.0~5.1 times faster while consuming 45~55% smaller memory, when using a word dictionary of 675K patterns. Other experimental results can be found in Wiki.

Installation

To use daachorse, depend on it in your Cargo manifest:

# Cargo.toml

[dependencies]
daachorse = "0.4"

Requirements

To compile this crate, Rust 1.58 or higher is required.

Example usage

Daachorse contains some search options, ranging from basic matching with the Aho-Corasick algorithm to trickier matching. All of them will run very fast based on the double-array data structure and can be easily plugged into your application as shown below.

Finding overlapped occurrences

To search for all occurrences of registered patterns that allow for positional overlap in the input text, use find_overlapping_iter(). When you use new() for constraction, unique identifiers are assigned to each pattern in the input order. The match result has the byte positions of the occurrence and its identifier.

use daachorse::DoubleArrayAhoCorasick;

let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_overlapping_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((0, 2, 1), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Finding non-overlapped occurrences with shortest matching

If you do not want to allow positional overlap, use find_iter() instead. It reports the first pattern found in each iteration, which is the shortest pattern starting from each search position.

use daachorse::DoubleArrayAhoCorasick;

let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Finding non-overlapped occurrences with longest matching

If you want to search for the longest pattern without positional overlap in each iteration, use leftmost_find_iter() with specifying MatchKind::LeftmostLongest in the construction.

use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};

let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
          .match_kind(MatchKind::LeftmostLongest)
          .build(&patterns)
          .unwrap();

let mut it = pma.leftmost_find_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 4, 2), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Finding non-overlapped occurrences with leftmost-first matching

If you want to find the the earliest registered pattern among ones starting from the search position, use leftmost_find_iter() with specifying MatchKind::LeftmostFirst.

This is so-called the leftmost first match, a bit tricky search option that is also supported in the aho-corasick crate. For example, in the following code, ab is reported because it is the earliest registered one.

use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};

let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
          .match_kind(MatchKind::LeftmostFirst)
          .build(&patterns)
          .unwrap();

let mut it = pma.leftmost_find_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 2, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Associating arbitrary values with patterns

To build the automaton from pairs of a pattern and integer value instead of assigning identifiers automatically, use with_values().

use daachorse::DoubleArrayAhoCorasick;

let patvals = vec![("bcd", 0), ("ab", 10), ("a", 20)];
let pma = DoubleArrayAhoCorasick::with_values(patvals).unwrap();

let mut it = pma.find_overlapping_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 1, 20), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((0, 2, 10), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Building faster automaton on multibyte characters

To build a faster automaton on multibyte characters, use CharwiseDoubleArrayAhoCorasick instead.

The standard version DoubleArrayAhoCorasick handles strings as UTF-8 sequences and defines transition labels using byte values. On the other hand, CharwiseDoubleArrayAhoCorasick uses code point values of Unicode, resulting in reducing the number of transitions and faster matching.

use daachorse::charwise::CharwiseDoubleArrayAhoCorasick;

let patterns = vec!["全世界", "世界", "に"];
let pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_iter("全世界中に");

let m = it.next().unwrap();
assert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

CLI

This repository contains a command line interface named daacfind for searching patterns in text files.

% cat ./pat.txt
fn
const fn
pub fn
unsafe fn
% find . -name "*.rs" | xargs cargo run --release -p daacfind -- --color=auto -nf ./pat.txt
...
...
./src/errors.rs:67:    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
./src/errors.rs:81:    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
./src/lib.rs:115:    fn default() -> Self {
./src/lib.rs:126:    pub fn base(&self) -> Option<u32> {
./src/lib.rs:131:    pub const fn check(&self) -> u8 {
./src/lib.rs:136:    pub const fn fail(&self) -> u32 {
...
...

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

at your option.

For softwares under bench/data, follow the license terms of each software.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].