All Projects → hankcs → Ahocorasickdoublearraytrie

hankcs / Ahocorasickdoublearraytrie

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Ahocorasickdoublearraytrie

Xseries
Library for cross-version Minecraft Bukkit support and various efficient API methods.
Stars: ✭ 109 (-84.32%)
Mutual labels:  algorithm, fast
Polysnap
A work in progress polygon operations library with integer snap-rounding
Stars: ✭ 14 (-97.99%)
Mutual labels:  algorithm, fast
Geokdbush
The fastest spatial index for geographic locations in JavaScript
Stars: ✭ 251 (-63.88%)
Mutual labels:  algorithm, fast
Delaunator
An incredibly fast JavaScript library for Delaunay triangulation of 2D points
Stars: ✭ 1,641 (+136.12%)
Mutual labels:  algorithm, fast
Kdbush
A fast static index for 2D points
Stars: ✭ 412 (-40.72%)
Mutual labels:  algorithm, fast
Fuzzysearch
🐷 Tiny and fast fuzzy search in Go
Stars: ✭ 602 (-13.38%)
Mutual labels:  algorithm
Learn Algorithms
算法学习笔记
Stars: ✭ 6,165 (+787.05%)
Mutual labels:  algorithm
Fast Android Networking
🚀 A Complete Fast Android Networking Library that also supports HTTP/2 🚀
Stars: ✭ 5,346 (+669.21%)
Mutual labels:  fast
Try
Dead simple CLI tool to try Python packages - It's never been easier! 📦
Stars: ✭ 588 (-15.4%)
Mutual labels:  fast
Agoo
A High Performance HTTP Server for Ruby
Stars: ✭ 679 (-2.3%)
Mutual labels:  fast
Atreugo
High performance and extensible micro web framework. Zero memory allocations in hot paths.
Stars: ✭ 661 (-4.89%)
Mutual labels:  fast
Algorithms
Data Structure Libraries and Algorithms implementation
Stars: ✭ 624 (-10.22%)
Mutual labels:  algorithm
Book on python algorithms and data structure
🪐 Book on Python, Algorithms, and Data Structures. 🪐
Stars: ✭ 604 (-13.09%)
Mutual labels:  algorithm
Goraph
Package goraph implements graph data structure and algorithms.
Stars: ✭ 635 (-8.63%)
Mutual labels:  algorithm
Diffabledatasources
💾 A library for backporting UITableView/UICollectionViewDiffableDataSource.
Stars: ✭ 601 (-13.53%)
Mutual labels:  algorithm
Dsa.js Data Structures Algorithms Javascript
🥞Data Structures and Algorithms explained and implemented in JavaScript + eBook
Stars: ✭ 6,251 (+799.42%)
Mutual labels:  algorithm
Data Structure And Algorithms With Es6
Data Structures and Algorithms using ES6
Stars: ✭ 594 (-14.53%)
Mutual labels:  algorithm
Nanomorph
🚅 - Hyper fast diffing algorithm for real DOM nodes
Stars: ✭ 621 (-10.65%)
Mutual labels:  algorithm
Depth clustering
🚕 Fast and robust clustering of point clouds generated with a Velodyne sensor.
Stars: ✭ 657 (-5.47%)
Mutual labels:  fast
Pbf
A low-level, lightweight protocol buffers implementation in JavaScript.
Stars: ✭ 618 (-11.08%)
Mutual labels:  fast

AhoCorasickDoubleArrayTrie

Maven Central GitHub release License

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie structure. Its speed is 5 to 9 times of naive implementations, perhaps it's the fastest implementation so far ;-)

Introduction

You may heard that Aho-Corasick algorithm is fast for parsing text with a huge dictionary, for example:

  • looking for certain words in texts in order to URL link or emphasize them
  • adding semantics to plain text
  • checking against a dictionary to see if syntactic errors were made

But most implementation use a TreeMap<Character, State> to store the goto structure, which costs O(lg(t)) time, t is the largest amount of a word's common prefixes. The final complexity is O(n * lg(t)), absolutely t > 2, so n * lg(t) > n. The others used a HashMap, which wasted too much memory, and still remained slowly.

I improved it by replacing the XXXMap to a Double Array Trie, whose time complexity is just O(1), thus we get a total complexity of exactly O(n), and take a perfect balance of time and memory. Yes, its speed is not related to the length or language or common prefix of the words of a dictionary.

This implementation has been widely used in my HanLP: Han Language Processing package. I hope it can serve as a common data structure library in projects handling text or NLP task.

Dependency

Include this dependency in your POM. Be sure to check for the latest version in Maven Central.

<dependency>
  <groupId>com.hankcs</groupId>
  <artifactId>aho-corasick-double-array-trie</artifactId>
  <version>1.2.2</version>
</dependency>

or include this dependency in your build.gradle.kts

implementation("com.hankcs:aho-corasick-double-array-trie:1.2.2")

Usage

Setting up the AhoCorasickDoubleArrayTrie is a piece of cake:

        // Collect test data set
        TreeMap<String, String> map = new TreeMap<String, String>();
        String[] keyArray = new String[]
                {
                        "hers",
                        "his",
                        "she",
                        "he"
                };
        for (String key : keyArray)
        {
            map.put(key, key);
        }
        // Build an AhoCorasickDoubleArrayTrie
        AhoCorasickDoubleArrayTrie<String> acdat = new AhoCorasickDoubleArrayTrie<String>();
        acdat.build(map);
        // Test it
        final String text = "uhers";
        List<AhoCorasickDoubleArrayTrie.Hit<String>> wordList = acdat.parseText(text);

Of course, there remains many useful methods to be discovered, feel free to try:

  • Use a Map<String, SomeObject> to assign a SomeObject as value to a keyword.
  • Store the AhoCorasickDoubleArrayTrie to disk by calling save method.
  • Restore the AhoCorasickDoubleArrayTrie from disk by calling load method.
  • Use it in concurrent code. AhoCorasickDoubleArrayTrie is thread safe after build method

In other situations you probably do not need a huge wordList, then please try this:

        acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<String>()
        {
            @Override
            public void hit(int begin, int end, String value)
            {
                System.out.printf("[%d:%d]=%s\n", begin, end, value);
            }
        });

or a lambda function

        acdat.parseText(text, (begin, end, value) -> {
            System.out.printf("[%d:%d]=%s\n", begin, end, value);
        });

Comparison

I compared my AhoCorasickDoubleArrayTrie with robert-bor's aho-corasick, ACDAT represents for AhoCorasickDoubleArrayTrie and Naive represents for aho-corasick, the result is :

Parsing English document which contains 3409283 characters, with a dictionary of 127142 words.
               	Naive          	ACDAT
time           	607            	102
char/s         	5616611.20     	33424343.14
rate           	1.00           	5.95
===========================================================================
Parsing Chinese document which contains 1290573 characters, with a dictionary of 146047 words.
               	Naive          	ACDAT
time           	319            	35
char/s         	2609156.74     	23780600.00
rate           	1.00           	9.11
===========================================================================

In English test, AhoCorasickDoubleArrayTrie is 5 times faster. When it comes to Chinese, AhoCorasickDoubleArrayTrie is 9 times faster. This test is conducted under i7 2.0GHz, -Xms512m -Xmx512m -Xmn256m. Feel free to re-run this test in TestAhoCorasickDoubleArrayTrie, the test data is ready for you.

Thanks

This project is inspired by aho-corasick and darts-clone-java. Many thanks!

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].