All Projects → takawitter → Trie4j

takawitter / Trie4j

Licence: apache-2.0
PATRICIA, Double Array, LOUDS Trie implementations for Java

Programming Languages

java
68154 projects - #9 most used programming language

Labels

Projects that are alternatives of or similar to Trie4j

Bitcask
🔑A high performance Key/Value store written in Go with a predictable read/write performance and high throughput. Uses a Bitcask on-disk layout (LSM+WAL) similar to Riak.
Stars: ✭ 654 (+370.5%)
Mutual labels:  trie
Efrt
neato compression for key-value data
Stars: ✭ 58 (-58.27%)
Mutual labels:  trie
Data Structures
Data-Structures using C++.
Stars: ✭ 121 (-12.95%)
Mutual labels:  trie
Kua
⚡️ Lightning fast URL routing in Python (trie router)
Stars: ✭ 18 (-87.05%)
Mutual labels:  trie
Qp Trie Rs
An idiomatic and fast QP-trie implementation in pure Rust.
Stars: ✭ 47 (-66.19%)
Mutual labels:  trie
Flash
Golang Keyword extraction/replacement Datastructure using Tries instead of regexes
Stars: ✭ 79 (-43.17%)
Mutual labels:  trie
Hat Trie
C++ implementation of a fast and memory efficient HAT-trie
Stars: ✭ 565 (+306.47%)
Mutual labels:  trie
Java Ds Algorithms
Data Structures and Algorithms in Java
Stars: ✭ 125 (-10.07%)
Mutual labels:  trie
Thmap
Concurrent trie-hash map library
Stars: ✭ 51 (-63.31%)
Mutual labels:  trie
Buckets Swift
Swift Collection Data Structures Library
Stars: ✭ 106 (-23.74%)
Mutual labels:  trie
Autosuggest Trie
Minimalistic trie implementation for autosuggest and autocomplete components
Stars: ✭ 22 (-84.17%)
Mutual labels:  trie
Libgenerics
libgenerics is a minimalistic and generic library for C basic data structures.
Stars: ✭ 42 (-69.78%)
Mutual labels:  trie
Tongrams
A C++ library providing fast language model queries in compressed space.
Stars: ✭ 88 (-36.69%)
Mutual labels:  trie
Nano Sql
Universal database layer for the client, server & mobile devices. It's like Lego for databases.
Stars: ✭ 717 (+415.83%)
Mutual labels:  trie
Gse
Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词
Stars: ✭ 1,695 (+1119.42%)
Mutual labels:  trie
Pyahocorasick
Python module (C extension and plain python) implementing Aho-Corasick algorithm
Stars: ✭ 593 (+326.62%)
Mutual labels:  trie
Adaptive Radix Tree
A fast and space efficient Radix tree in Java
Stars: ✭ 76 (-45.32%)
Mutual labels:  trie
Slim
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).
Stars: ✭ 1,705 (+1126.62%)
Mutual labels:  trie
Trienet
.NET Implementations of Trie Data Structures for Substring Search, Auto-completion and Intelli-sense. Includes: patricia trie, suffix trie and a trie implementation using Ukkonen's algorithm.
Stars: ✭ 122 (-12.23%)
Mutual labels:  trie
Completely
Java autocomplete library.
Stars: ✭ 90 (-35.25%)
Mutual labels:  trie

Trie4J - various trie implementation for Java.

Trie4J is the sort of collection of various trie implementation.

You can get the binary using Maven:

<dependency>
    <groupId>com.github.takawitter</groupId>
    <artifactId>trie4j</artifactId>
    <version>0.9.8</version>
</dependency>

or from Central Repository

  • Coming release: nothing planned.
  • 2018/1/24: 0.9.8 released.
  • 2017/10/22: 0.9.7 released.
  • 2017/7/20: 0.9.6 released.
  • 2016/10/28: 0.9.5 released.
  • 2016/10/22: 0.9.4 released.
  • 2016/5/28: 0.9.3 released.

Performance comparison:

with 1.27 million words and 10.04 million chars contained in jawiki-20120220-all-titles-in-ns0.gz .
on MacOS X(10.7), Core i7 2.5GHz, Java 7 Update 21. 2013-5-14.

class notes build(ms) contains(ms) used heap(MB)
java.util.HashSet 404*1 363 126.2
java.util.TreeSet 416*1 235 125.9
PatriciaTrie(src) Simple PATRICIA Trie. 371*1 247 90.8
TailPatriciaTrie(src) SuffixTrieTailBuilder(src) PATRICIA Trie with tail string. 937*1 248 76.8
ConcatTailBuilder(src) 440*1 228 69.2
DoubleArray(src) Simple Double Array Trie. 362*2 98 52.6
TailDoubleArray(src) SuffixTrieTailBuilder(src) Double Array Trie with tail string. 3,111*2 181 28.9
ConcatTailBuilder(src) 2,532*2 154 33.6
TailLOUDSTrie(src) SuffixTrieTailBuilder(src) LOUDS Succinct Trie with tail string. 620*2 554 15.4
ConcatTailBuilder(src) 111*2 537 20.2
ConcatTailBuilder(src) with sbvTI*3 145*2 712 15.7
TailLOUDSPPTrie(src) SuffixTrieTailBuilder(src) LOUDS++ Succinct Trie with tail string. 654*2 571 15.4
ConcatTailBuilder(src) 119*2 552 20.1
ConcatTailBuilder(src) with sbvTI*3 163*2 741 15.6
*1 - build from string array.
*2 - build from other trie(org.trie4j.patricia.simple.PatriciaTrie).
*3 - shrinked index array using succinct bit vector.

Sample codes:

import org.trie4j.doublearray.DoubleArray;
import org.trie4j.louds.TailLOUDSTrie;
import org.trie4j.patricia.PatriciaTrie;

public class Sample {
	public static void main(String[] args) throws Exception{
		PatriciaTrie pat = new PatriciaTrie();
		pat.insert("Hello");
		pat.insert("World");
		pat.insert("Wonder");
		pat.insert("Wonderful!");
		pat.contains("Hello"); // -> true
		pat.predictiveSearch("Wo"); // -> {"Wonder", "Wonderful!", "World"} as Iterable<String>
		
		DoubleArray da = new DoubleArray(pat); // construct DoubleArray from existing Trie
		da.contains("World"); // -> true
		
		TailLOUDSTrie lt = new TailLOUDSTrie(pat); // construct LOUDS succinct Trie with ConcatTailBuilder(default)
		lt.contains("Wonderful!"); // -> true
		lt.commonPrefixSearch("Wonderful!"); // -> {"Wonder", "Wonderful!"} as Iterable<String>
	}
}

additional notes.

These classes are experimental and not contained in trie4j-SNAPSHOT.jar.

  • MultilayerPatriciaTrie: PatriciaTrie which has the pointer to another trie which store reversed tail string. (src)
  • optimizes size using Multilayer Trie but no significant improvement.
  • OptimizedTailDoubleArray: DoubleArray with Tail Array and some optimization (src)
  • feature completed but can't support large data (over several 10 thousants).

additional notes(ja).

2012年2月、1冊の本が発売されました。

"日本語入力を支える技術" 変わり続けるコンピュータと言葉の世界 (WEB+DB PRESS plus) 徳永 拓之 (著)

日本語入力を支える技術

多くのエンジニアがこの本に触発され、各種アルゴリズムの理解を深めたり、一から勉強を始めたり、 また中にはこれを機に様々なライブラリを実装し公開する人も出てきました。trie4jもそういったライブラリの一つで、各種トライ構造にターゲットを絞り、本書やその分野のブログなどを参考に実装されています。

ほとんどのクラスはシンプルな実装になっていますが、一部独自の最適化が入っています。また、各トライが提供するメソッドは、 極力中間オブジェクトを作らないようになっており、オブジェクト生成/破棄によるパフォーマンス低下を起こさないよう実装されています。

下記クラスは実験的実装で、trie4j-SNAPSHOT.jarには含まれません(src.kitchensinkにあります)。

  • 多層パトリシアトライ(MultilayerPatriciaTrie(src))

  • 多層トライの実験結果 - やた@はてな日記 を参考に、接尾辞を格納するトライを内包しサイズを最適化した実装です。また、子を持たないノード、子を一つだけ持つノード、それぞれの終端/非終端版と、様々な種類のノードを用意して 使い分けることで、極力無駄なメモリを使わないようにしています。但しパトリシアトライのままなので、あまり効率が上がっていません。

  • TAIL配列付き最適化ダブルアレイ(OptimizedTailDoubleArray(src))

    • 未使用領域の開放やcheck配列をshortにした。実装は完了していますが、大規模なデータ(数万レコード超)には対応できません。
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].