All Projects → mganss → AhoCorasick

mganss / AhoCorasick

Licence: MIT License
Aho-Corasick multi-string search for .NET and SQL Server.

Programming Languages

C#
18002 projects

Projects that are alternatives of or similar to AhoCorasick

multi string replace
A fast multiple string replace library for ruby. Uses a C implementation of the Aho–Corasick Algorithm based on https://github.com/morenice/ahocorasick while adding support for on the fly multiple string replacement. Faster alternative to String.gsub when dealing with non-regex (exact match) use cases
Stars: ✭ 16 (-58.97%)
Mutual labels:  string-search
dba-database
Database containing DBA helper code and open source software.
Stars: ✭ 79 (+102.56%)
Mutual labels:  sql-server
TODO-List-Tech-Module
TODO List (in C#, Java, JS and PHP) - Exam Preparation for the Tech Module @ SoftUni (August 2017)
Stars: ✭ 13 (-66.67%)
Mutual labels:  sql-server
sql server
Development repository for the sql_server cookbook
Stars: ✭ 60 (+53.85%)
Mutual labels:  sql-server
PdoOne
It simplifies the use of PDO in PHP by adding three methods: a simple wrapper between PDO, a query builder and an ORM.
Stars: ✭ 93 (+138.46%)
Mutual labels:  sql-server
CEDS-IDS
The CEDS Integrated Data Store factors the entities and attributes of the CEDS Domain Entity Schema (DES) with standard technical syntax and 3rd normal form database normalization. The IDS Logical Model provides a standard framework for integration of P-20 data systems through a well-normalized “operational data store”. In a P-20 data system, th…
Stars: ✭ 29 (-25.64%)
Mutual labels:  sql-server
StoreCleanArchitecture-NET
This is a basic project to demonstrate an introduction about the implementation of Clean Architecture on .NET
Stars: ✭ 19 (-51.28%)
Mutual labels:  sql-server
RCM
RCM is a simple CRM application designed for Auto Parts Store made with ASP.NET Core based on DDD, CQRS and SOLID Principles.
Stars: ✭ 29 (-25.64%)
Mutual labels:  sql-server
OrdersManagementSystem
Project demonstrates usage of Prism composition library, Material design library, SQL Server, Entity Framework in WPF application
Stars: ✭ 29 (-25.64%)
Mutual labels:  sql-server
DIMS
🐟 数据库系统原理课程设计,Drug Inventory Management System,基于 SSM 框架的医院药品库存管理系统。
Stars: ✭ 49 (+25.64%)
Mutual labels:  sql-server
algorithm-bootcamp-sql
Teaching materials for Algorithm Bootcamp: Database Systems.
Stars: ✭ 18 (-53.85%)
Mutual labels:  sql-server
YelpDatasetSQL
Working with the Yelp Dataset in Azure SQL and SQL Server
Stars: ✭ 16 (-58.97%)
Mutual labels:  sql-server
docker-why
Quick example of using SQL Server and .NET Core on Linux, loading data using bash
Stars: ✭ 41 (+5.13%)
Mutual labels:  sql-server
aho-corasick-node
A Node implementation of the Aho-Corasick string matching algorithm based on DoubleArray Trie.
Stars: ✭ 16 (-58.97%)
Mutual labels:  aho-corasick
SQL-nightmare
SQL SERVER Exploitation.
Stars: ✭ 27 (-30.77%)
Mutual labels:  sql-server
N-Tier-Architecture
This is a n-layer architecture based on Common web application architectures.
Stars: ✭ 105 (+169.23%)
Mutual labels:  sql-server
SQLDBA-SSMS-Solution
This respository contains TSQL/PowerShell Scripts to resolve issues of SQL Servers
Stars: ✭ 21 (-46.15%)
Mutual labels:  sql-server
php aho corasick
Aho-Corasick string search algorithm PHP extension implementation.
Stars: ✭ 45 (+15.38%)
Mutual labels:  aho-corasick
ahocorapy
Pure python Aho-Corasick library.
Stars: ✭ 163 (+317.95%)
Mutual labels:  aho-corasick
sp CRUDGen
sp_CRUDGen is a free open-source SQL Server stored procedure that generates stored procedures for you based on your tables and metadata like foreign keys and data types. The generated stored procedure code utilizes the SQL Server community best practices.
Stars: ✭ 37 (-5.13%)
Mutual labels:  sql-server

AhoCorasick

Version Build status Coverage Status netstandard2.0 net40

This is an implementation of the Aho-Corasick string matching algorithm for .NET (netstandard2.0 and net40) and SQL Server (SQL CLR). Mostly ported from xudejian/aho-corasick in CoffeeScript.

Usage

var ac = new AhoCorasick("a", "ab", "bab", "bc", "bca", "c", "caa");
var results = ac.Search("abccab").ToList();

Assert.AreEqual(0, results[0].Index); // index into the searched text
Assert.AreEqual("a", results[0].Word); // matched word
// ...

or

var results = "abccab".Contains("a", "ab", "bab", "bc", "bca", "c", "caa").ToList();

Custom char comparison

You can optionally supply an IEqualityComparer<char> to perform custom char comparisons when searching for substrings. Several implementations with comparers that mirror StringComparer are included.

var results = "AbCcab".Contains(CharComparer.OrdinalIgnoreCase, "a", "ab", "c").ToList();

SQL CLR Functions

There are also several SQL CLR user defined functions that can be used to perform fast substring matching in Microsoft SQL Server. To use this:

  1. Make sure you have enabled CLR integration
  2. Execute AhoCorasick.SqlClr_Create.sql

For one-off queries, you can use the functions that rebuild the trie on each query, e.g.

select top(100) * from Posts P
where dbo.ContainsWords((select Word from Words for xml raw, root('root')), P.Body, 'o') = 1

The words to match are always supplied as XML where the values are taken from the first attribute of all elements directly beneath the root node. Be careful to select the word column as the only or first column otherwise you'll end up matching the wrong words. The XML in the example above looks like this:

<root>
  <row Word="Aachen" />
  <row Word="Aaliyah" />
  <row Word="aardvark" />
  ...
</root>

Here's more about FOR XML.

The last parameter in the function indicates the culture to use since there is no way to use SQL Server collations in SQL CLR code. Values can be:

Value Character comparison
c Current Culture
n Invariant Culture
o or Empty Ordinal
Culture name, e.g. "de-de" Specific .NET Culture

The culture identifier can be suffixed by :i indicating case-insensitive matching.

Static objects

The function in the example above has the problem that the trie is rebuilt for each query even though the input always stays the same. To overcome this problem, there are a number of functions to manage the creation and destruction of static objects whose handles can be saved in SQL variables. Example:

declare @ac nvarchar(32);
set @ac = dbo.CreateAhoCorasick((select Word from Words for xml raw, root('root')), 'en-us:i');
select * from Posts P
where dbo.ContainsWordsByObject(P.Body, @ac) = 1;

This is a lot faster than the first example because the trie is created only once and then reused for each row in the query. The handle (@ac) is a hash value generated from the words to match and the culture. The corresponding object is saved in a static dictionary. You can list the currently active objects using dbo.ListAhoCorasick(), remove all objects using dbo.ClearAhoCorasick() or remove only one object using dbo.DeleteAhoCorasick(@ac).

Getting all matches

The examples above only checked if the words occurred in the queried texts. If you want to get the matched words and the indexes where they occur in the queried texts you can use the supplied table-valued functions. For example:

declare @ac nvarchar(32);
set @ac = dbo.CreateAhoCorasick((select Word from Words for xml raw, root('root')), 'o');
select top(100) * from Posts P
cross apply dbo.ContainsWordsTableByObject(P.Body, @ac) W

This will return a table such as this:

ID Body Index Word
1 What factors related... 5 factor
1 What factors related... 6 actor
1 What factors related... 5 factors
...

Word boundaries

There are also functions that return only matches occuring at word boundaries: dbo.ContainsWordsBoundedByObject() and dbo.ContainsWordsBoundedTableByObject(). Word boundaries here are the same as \b in regexes, i.e. matches will occur as if words were specified as \bword\b.

Forcing parallelism

Although these kinds of queries lend themselves very well to parallel execution, SQL Server tends to overestimate the cost of parallel queries and builds non-parallel plans most of the time where user defined functions are involved. You can force a parallel plan by using a trace flag (more about this here):

declare @ac nvarchar(32);
set @ac = dbo.CreateAhoCorasick((select Word from Words for xml raw, root('root')), 'en-us:i');
select * from Posts P
where dbo.ContainsWordsBoundedByObject(P.Body, @ac) = 1
OPTION (RECOMPILE, QUERYTRACEON 8649)

Parallel operators are identified by a yellow badge with two arrows in the query plan.

Performance

Here's a benchmark searching for ~5000 words (average length 7) in ~250,000 texts (average length ~900):

SQL AhoCorasick
560s 7s

The SQL query used was this:

select * from Posts P
where exists (select * from Words W where CHARINDEX(W.Word, P.Text) > 0)

But I can simply use full-text search

No. The CONTAINS predicate can only search for a single literal or variable at a time. You can't use it in a join or subquery to search for a column value of a table in the query, i.e. this won't work:

select * from Posts P
where exists (select * from Words W where CONTAINS(P.Text, W.Word))

If you know of a way to make this work using FTS (perhaps using a cursor?) let me know.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].