All Projects → abusix → ahocorapy

abusix / ahocorapy

Licence: MIT License
Pure python Aho-Corasick library.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to ahocorapy

rdkit-pypi
⚛️ RDKit Python Wheels on PyPi. 💻 pip install rdkit-pypi
Stars: ✭ 62 (-61.96%)
Mutual labels:  pypi
text-normalizer
Normalize text string
Stars: ✭ 12 (-92.64%)
Mutual labels:  pypi
django-admin-page-lock
Page Lock for Django Admin allows developers to implement customizable locking pages.
Stars: ✭ 13 (-92.02%)
Mutual labels:  pypi
maloss
Towards Measuring Supply Chain Attacks on Package Managers for Interpreted Languages
Stars: ✭ 46 (-71.78%)
Mutual labels:  pypi
company-ansible
Ansible keywords completion for Emacs
Stars: ✭ 21 (-87.12%)
Mutual labels:  keyword
openrefine-client
The OpenRefine Python Client from Paul Makepeace provides a library for communicating with an OpenRefine server. This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, Mac). It is also available via Docker Hub, PyPI and Binder.
Stars: ✭ 67 (-58.9%)
Mutual labels:  pypi
ssrf-agent
make http(s) request to prevent SSRF
Stars: ✭ 16 (-90.18%)
Mutual labels:  lookup
mongo
Light-weight utilities and declarative schema (mutable mapping) to augment, not replace the Python MongoDB driver.
Stars: ✭ 18 (-88.96%)
Mutual labels:  pypi
proxpi
PyPI caching mirror
Stars: ✭ 19 (-88.34%)
Mutual labels:  pypi
cookiecutter-pypackage
A cookiecutter template for Python package with heavy use of Github actions
Stars: ✭ 19 (-88.34%)
Mutual labels:  pypi
aho-corasick-node
A Node implementation of the Aho-Corasick string matching algorithm based on DoubleArray Trie.
Stars: ✭ 16 (-90.18%)
Mutual labels:  aho-corasick
querytool
Querytool is an OSINT framework based on Google Spreadsheets. With this tool you can perform complex search of terms, people, email addresses, files and many more.
Stars: ✭ 104 (-36.2%)
Mutual labels:  lookup
intrepid
Intrepyd Model Checker
Stars: ✭ 14 (-91.41%)
Mutual labels:  pypi
aceso
Python package to calculate 2SFCA and other measures of spatial accessibility
Stars: ✭ 20 (-87.73%)
Mutual labels:  pypi
starcli
✨ Browse trending GitHub projects from your command line
Stars: ✭ 436 (+167.48%)
Mutual labels:  pypi
flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
Stars: ✭ 82 (-49.69%)
Mutual labels:  pypi
pipsalabim
An assistant to guess your pip dependencies from your code, without using a requirements file.
Stars: ✭ 15 (-90.8%)
Mutual labels:  pypi
contour
Modern C++ Terminal Emulator
Stars: ✭ 761 (+366.87%)
Mutual labels:  unicode-support
imgur-scraper
Retrieve years of imgur.com's data without any authentication.
Stars: ✭ 26 (-84.05%)
Mutual labels:  pypi
craft-text-detector
Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector
Stars: ✭ 151 (-7.36%)
Mutual labels:  pypi

Test Test Coverage Downloads PyPi Version PyPi License PyPi Versions PyPi Wheel

ahocorapy - Fast Many-Keyword Search in Pure Python

ahocorapy is a pure python implementation of the Aho-Corasick Algorithm. Given a list of keywords one can check if at least one of the keywords exist in a given text in linear time.

Comparison:

Why another Aho-Corasick implementation?

We started working on this in the beginning of 2016. Our requirements included unicode support combined with python2.7. That was impossible with C-extension based libraries (like pyahocorasick). Pure python libraries were very slow or unusable due to memory explosion. Since then another pure python library was released py-aho-corasick. The repository also contains some discussion about different implementations. There is also acora, but it includes the note ('current construction algorithm is not suitable for really large sets of keywords') which really was the case the last time I tested, because RAM ran out quickly.

Differences

  • Compared to pyahocorasick our library supports unicode in python 2.7 just like py-aho-corasick. We don't use any C-Extension so the library is not platform dependant.

  • On top of the standard Aho-Corasick longest suffix search, we also perform a shortcutting routine in the end, so that our lookup is fast while, the setup takes longer. During set up we go through the states and directly add transitions that are "offered" by the longest suffix or their longest suffixes. This leads to faster lookup times, because in the end we only have to follow simple transitions and don't have to perform any additional suffix lookup. It also leads to a bigger memory footprint, because the number of transitions is higher, because they are all included explicitely and not implicitely hidden by suffix pointers.

  • We added a small tool that helps you visualize the resulting graph. This may help understanding the algorithm, if you'd like. See below.

  • Fully pickleable (pythons built-in de-/serialization). ahocorapy uses a non-recursive custom implementation for de-/serialization so that even huge keyword trees can be pickled.

Performance

I compared the two libraries mentioned above with ahocorapy. We used 50,000 keywords long list and an input text of 34,199 characters. In the text only one keyword of the list is contained. The setup process was run once per library and the search process was run 100 times. The following results are in seconds (not averaged for the lookup).

You can perform this test yourself using python tests/ahocorapy_performance_test.py. (Except for the pyahocorasick_py results. These were taken by importing the pure python version of the code of pyahocorasick. It's not available through pypi as stated in the code.)

I also added measurements for the pure python libraries with run with pypy.

These are the results:

Library (Variant) Setup (1x) Search (100x)
ahocorapy* 0.32s 0.36s
ahocorapy (run with pypy)* 0.36s 0.10s
pyahocorasick* 0.03s 0.04s
pyahocorasick (run with pypy)* 0.08s 0.04s
pyahocorasick (pure python variant in github repo)** 0.50s 1.68s
py_aho_corasick* 0.77s 6.02s
py_aho_corasick (run with pypy)* 0.72s 2.11s

As expected the C-Extension shatters the pure python implementations. Even though there is probably still room for optimization in ahocorapy we are not going to get to the mark that pyahocorasick sets. ahocorapy's lookups are faster than py_aho_corasick. When run with pypy ahocorapy is almost as fast as pyahocorasick, at least when it comes to searching. The setup overhead is higher due to the suffix shortcutting mechanism used.

* Specs

Dell XPS 15 7590
CPU: Intel i9-9980HK (16) @ 5.000GHz
CPython: 3.8.2
pypy: PyPy 7.3.1 with GCC 7.3.1 20180303
Date tested: 2020-08-28

** Old measurement with different specs

Basic Usage:

Installation

pip install ahocorapy

Creation of the Search Tree

from ahocorapy.keywordtree import KeywordTree
kwtree = KeywordTree(case_insensitive=True)
kwtree.add('malaga')
kwtree.add('lacrosse')
kwtree.add('mallorca')
kwtree.add('mallorca bella')
kwtree.add('orca')
kwtree.finalize()

Searching

result = kwtree.search('My favorite islands are malaga and sylt.')
print(result)

Prints :

('malaga', 24)

The search_all method returns a generator for all keywords found, or None if there is none.

results = kwtree.search_all('malheur on mallorca bellacrosse')
for result in results:
    print(result)

Prints :

('mallorca', 11)
('orca', 15)
('mallorca bella', 11)
('lacrosse', 23)

Thread Safety

The construction of the tree is currently NOT thread safe. That means adding shouldn't be called multiple times concurrently. Behavior is undefined.

After finalize is called you can use the search functionality on the same tree from multiple threads at the same time. So that part is thread safe.

Drawing Graph

You can print the underlying graph with the Visualizer class. This feature requires a working pygraphviz library installed.

from ahocorapy_visualizer.visualizer import Visualizer
visualizer = Visualizer()
visualizer.draw('readme_example.png', kwtree)

The resulting .png of the graph looks like this:

graph for kwtree

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].