All Projects → ph4r05 → php_aho_corasick

ph4r05 / php_aho_corasick

Licence: other
Aho-Corasick string search algorithm PHP extension implementation.

Programming Languages

c
50402 projects - #5 most used programming language
PHP
23972 projects - #3 most used programming language
C++
36643 projects - #6 most used programming language
shell
77523 projects
Dockerfile
14818 projects
M4
1887 projects

Projects that are alternatives of or similar to php aho corasick

skeleton-php-ext
Skeleton project for PHP extension (written in C)
Stars: ✭ 24 (-46.67%)
Mutual labels:  php-extension, pecl
Toolgood.words
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。
Stars: ✭ 2,785 (+6088.89%)
Mutual labels:  aho-corasick, string-matching
kaliningraph
🕸️ Graphs, finite fields and discrete dynamical systems in Kotlin
Stars: ✭ 62 (+37.78%)
Mutual labels:  automata
computation-py
Python implementation for Understanding Computation book.
Stars: ✭ 22 (-51.11%)
Mutual labels:  automata
beda
Beda is a golang library for detecting how similar a two string
Stars: ✭ 34 (-24.44%)
Mutual labels:  string-matching
AALpy
An Active Automata Learning Library Written in Python
Stars: ✭ 60 (+33.33%)
Mutual labels:  automata
NTUA-slp-nlp
💻Speech and Natural Language Processing (SLP & NLP) Lab Assignments for ECE NTUA
Stars: ✭ 19 (-57.78%)
Mutual labels:  automata
TeamReference
Team reference for Competitive Programming. Algorithms implementations very used in the ACM-ICPC contests. Latex template to build your own team reference.
Stars: ✭ 29 (-35.56%)
Mutual labels:  string-matching
swoole
FastD Swoole 基础组件
Stars: ✭ 76 (+68.89%)
Mutual labels:  pecl
php-handlebars
PHP bindings for handlebars.c
Stars: ✭ 23 (-48.89%)
Mutual labels:  pecl
stance
Learned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-40%)
Mutual labels:  string-matching
ext-pq
PostgreSQL client library (libpq) binding
Stars: ✭ 33 (-26.67%)
Mutual labels:  pecl
Dev-Tools-Magento-2-Module
A collection of utilities meant to improve the experience of developing modules for Magento without breaking existing functionality.
Stars: ✭ 18 (-60%)
Mutual labels:  php-extension
phphll
HyperLogLog for PHP implemented as a C extension
Stars: ✭ 19 (-57.78%)
Mutual labels:  php-extension
php-rar
PECL rar extension
Stars: ✭ 35 (-22.22%)
Mutual labels:  php-extension
aho-corasick-node
A Node implementation of the Aho-Corasick string matching algorithm based on DoubleArray Trie.
Stars: ✭ 16 (-64.44%)
Mutual labels:  aho-corasick
FormaleSysteme
Unterlagen zur Vorlesung "Formale Systeme", Fakultät Informatik, TU Dresden
Stars: ✭ 31 (-31.11%)
Mutual labels:  automata
pharext
Distribute your PHP extension as self-installing phar executable
Stars: ✭ 57 (+26.67%)
Mutual labels:  pecl
stringbench
String matching algorithm benchmark
Stars: ✭ 31 (-31.11%)
Mutual labels:  string-matching
ahocorapy
Pure python Aho-Corasick library.
Stars: ✭ 163 (+262.22%)
Mutual labels:  aho-corasick

php_aho_corasick

Build Status Coverity Status

PHP extension implementing Aho-Corasick pattern matching algorithm (more on wiki).

Is especially effective if there is a large database of needles (=strings to be searched, for example virus signatures). Another advantage is that built search structure is initialized before search in separate call thus it can be called more times with different haystack, saving time.

Computing Aho-Corasick in th native code (PHP extension) rather than in a pure PHP manner gives this implementation significant performance boost.

Dependencies

This project is simple PHP wrapper of (or interface to) another project: MultiFast. Sources include MultiFast library v 2.0. No extra dependencies are required. MultiFast library is wrapped as PHP extension loadable to PHP.

Source of inspiration for this project was a great tutorial.

Compatible with PHP 5.3+ and PHP 7.0+.

PECL & Licensing

The original project MultiFast is licensed under LGPLv3 so this PHP wrapper is also licensed under LGPLv3. Thanks to the author of the MultiFast, Kamiar Kanani, who gave me a permission to license the code under PHP License 3.01 for the purpose of adding this extension to PECL repository.

https://pecl.php.net/package/ahocorasick

Pecl installation:

pecl install channel://pecl.php.net/ahocorasick-0.0.7

Note the php-dev (or php-devel, depends on your distribution) is required for pecl package to compile.

Build

phpize
./configure --enable-ahocorasick
make

Docker build

$ docker build -t="ahoc" .
$ docker run -i -t ahoc
$ php -d extension=modules/ahocorasick.so -f examples/test.php

Install debugging tools, remote debugging

$ docker build -t="ahoc" --build-arg DEVEL_TOOLS=1 .
$ docker run -i --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t ahoc
$ docker run -i --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --mount type=bind,src=`pwd`,dst=/aho -t ahoc

Recompiling, or cache busting:

$ docker build -t="ahoc" --build-arg DIR_BUSTER=`date`.

Usage

This extension is case sensitive, thus if you want case insensitive, convert every string input to this algorithm to lowercase (use mb_strtolower() for example).

For more usage examples, see provided testing examples.

examples/test.php:

$data = array(
  	array('key'=>'ab', 'value'=>'alfa'),
		array('key'=>'ac', 'value'=>'beta'),
		array('key'=>'ad', 'value'=>'gamma', 'aux'=>array(1)),
		array('key'=>'ae', 'value'=>'delta'),
		array('id'=>0, 'value'=>'zeta'),
		array('key'=>'ag', 'value'=>'omega'),
		array('value'=>'lfa')
	     );

// initialize search , returns resourceID for search structure
$c = ahocorasick_init($data);

// perform search
$d1 = ahocorasick_match("alFABETA gamma zetaomegaalfa!", $c);

// deinitialize search structure (will free memory)
ahocorasick_deinit($c);

var_dump($d1);

Call with:

php -d extension=modules/ahocorasick.so -f examples/test.php

Results with:

array(5) {
  [0]=>
  array(5) {
    ["pos"]=>
    int(14)
    ["key"]=>
    string(2) "ad"
    ["aux"]=>
    array(1) {
      [0]=>
      int(1)
    }
    ["start_postion"]=>
    int(9)
    ["value"]=>
    string(5) "gamma"
  }
  [1]=>
  array(4) {
    ["pos"]=>
    int(19)
    ["keyIdx"]=>
    int(0)
    ["start_postion"]=>
    int(15)
    ["value"]=>
    string(4) "zeta"
  }
  [2]=>
  array(4) {
    ["pos"]=>
    int(24)
    ["key"]=>
    string(2) "ag"
    ["start_postion"]=>
    int(19)
    ["value"]=>
    string(5) "omega"
  }
  [3]=>
  array(4) {
    ["pos"]=>
    int(28)
    ["key"]=>
    string(2) "ab"
    ["start_postion"]=>
    int(24)
    ["value"]=>
    string(4) "alfa"
  }
  [4]=>
  array(3) {
    ["pos"]=>
    int(28)
    ["start_postion"]=>
    int(25)
    ["value"]=>
    string(3) "lfa"
  }
}

Benchmark

In this repo you can find examples/benchmark.php file, with this you can perform your own benchmark and measure speed up.

My setup generates random haystacks and needles from alphabet="abcdef". There is performed 5 measurements of time spent by search and average is computed. Search structure construction is conted to time measurements.

Script generates:

  • 256 random haystacks of size 8192 characters
  • 2048 needles with 16 characters.

Principle:

  • Naive approach simply iterates over haystacks and needles, search is performed with strpos().
  • Aho-Corasick approach constructs search structure, then all haystacks are searched for needles.

Results:

$> php -d extension=modules/ahocorasick.so -f examples/benchmark.php
Classic search; sampleCount: 10; keySize: 2048; timeAvg: 13.060877
AhoCorasick search; sampleCount: 10; keySize: 2048; timeAvg: 0.174326 s, totalTime: 1.743264 s, memory increase: 272 B
AhoCorasick pattern matching is 74.921962 times faster than naive approach

Speedup: 74x compared to the naive approach.

API

Documentation writing is in progress.

Basic ideas of the API:

  • AhoCorasick pattern matching engine has to be initialized (ahocorasick_init()) before use and deinitialized (ahocorasick_deinit()) after use so memory is handled properly.
  • Engine has to be fed with pattern matching rules, given as array of rules, either to initialization function (ahocorasick_init()) or later (ahocorasick_add_patterns()).
  • After engine is finalized (ahocorasick_finalize()) or a first matching is performed (ahocorasick_match()) no further patterns are allowed, as underlying searching trie is finalized.
  • When matching finishes, it returns array of matched results. Each entry determines position of the found occurrence and pattern that was matched.
  • Modifications made during the php 7 migration: ** 'value' is the default key when adding patterns. ** 'start_postion' field added to the results. The original algorithm returns the end position of the matched patterns.

Rules:

  • Simplest pattern looks like:
array('lorem')
  • Pattern can be identified, so it is easier to process result from match call. Either by string
array('key'=>'ae', 'value'=>'delta')

or integer

array('id'=>0, 'value'=>'zeta')
  • Pattern can carry an arbitrary object
array('key'=>'ad', 'value'=>'gamma', 'aux'=>array(1))

Development

# Create package for distribution
pear package

OSX Mojave

Since Mojave the PHP module needs to be codesigned in order to be loaded to the process.

https://developer.apple.com/library/archive/technotes/tn2206/_index.html

Donating

This implementation is an open source. If you like the code or you do find it useful please feel free to donate to the author whatever amount you would like by clicking on the paypal button below. And if you don't feel like donating, that's OK too.

Bitcoin:

1Gh3TC55L4FjCyS2y5WKc4EGMYBYa6qvDw
1Gh3TC55L4FjCyS2y5WKc4EGMYBYa6qvDw

Monero:

89bF7TFrhdyczkz6JmUzXx57yQa3fb28tbyT8nXLpj3bVFfQEE6cpjxec1gAJVSWHBBG7ex2XBj7u6BLrgKBaEmuSzWgdcn
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].