All Projects → almondtools → stringbench

almondtools / stringbench

Licence: Apache-2.0 license
String matching algorithm benchmark

Programming Languages

java
68154 projects - #9 most used programming language
Batchfile
5799 projects

Projects that are alternatives of or similar to stringbench

multi string replace
A fast multiple string replace library for ruby. Uses a C implementation of the Aho–Corasick Algorithm based on https://github.com/morenice/ahocorasick while adding support for on the fly multiple string replacement. Faster alternative to String.gsub when dealing with non-regex (exact match) use cases
Stars: ✭ 16 (-48.39%)
Mutual labels:  string-matching, string-search
godot-size-benchmarks
Benchmarks to compare Godot binary sizes with different build-time options
Stars: ✭ 36 (+16.13%)
Mutual labels:  benchmark
scATAC-benchmarking
Benchmarking computational single cell ATAC-seq methods
Stars: ✭ 137 (+341.94%)
Mutual labels:  benchmark
Hetero-Mark
A Benchmark Suite for Heterogeneous System Computation
Stars: ✭ 41 (+32.26%)
Mutual labels:  benchmark
Edge-Detection-project
Tiny Image in Javascript - Edge Detection Algorithms
Stars: ✭ 27 (-12.9%)
Mutual labels:  benchmark
gtestx
A C++ benchmark extension for gtest
Stars: ✭ 19 (-38.71%)
Mutual labels:  benchmark
touchstone
Smart benchmarking of pull requests with statistical confidence
Stars: ✭ 33 (+6.45%)
Mutual labels:  benchmark
beda
Beda is a golang library for detecting how similar a two string
Stars: ✭ 34 (+9.68%)
Mutual labels:  string-matching
user-agent-parser-benchmarks
PHP User Agent Parser Benchmarks
Stars: ✭ 29 (-6.45%)
Mutual labels:  benchmark
bench
⏱️ Reliable performance measurement for Go programs. All in one design.
Stars: ✭ 33 (+6.45%)
Mutual labels:  benchmark
FastFuzzyStringMatcherDotNet
A BK tree implementation for fast fuzzy string matching
Stars: ✭ 23 (-25.81%)
Mutual labels:  string-matching
perf
Linux Perf subsystem bindings for Go
Stars: ✭ 19 (-38.71%)
Mutual labels:  benchmark
CellBench
R package for benchmarking single cell analysis methods
Stars: ✭ 21 (-32.26%)
Mutual labels:  benchmark
eCommerceSearchBench
E-commerce search benchmark is the first end-to-end application benchmark for e-commerce search system with personalized recommendations.This work is joint with Prof. Jianfeng Zhan (http://www.benchcouncil.org/zjf.html) 's team, who is also the chair of International Open Benchmark Council (BenchCouncil, http://www.benchcouncil.org/).
Stars: ✭ 29 (-6.45%)
Mutual labels:  benchmark
Visual-Tracking-Development
Recent Trackers
Stars: ✭ 93 (+200%)
Mutual labels:  benchmark
benchdiff
No description or website provided.
Stars: ✭ 41 (+32.26%)
Mutual labels:  benchmark
Language-Arena
C++ vs D vs Go benchmark
Stars: ✭ 19 (-38.71%)
Mutual labels:  benchmark
yjit-bench
Set of benchmarks for the YJIT CRuby JIT compiler
Stars: ✭ 38 (+22.58%)
Mutual labels:  benchmark
text-style-transfer-benchmark
Text style transfer benchmark
Stars: ✭ 56 (+80.65%)
Mutual labels:  benchmark
benchmark-http
No description or website provided.
Stars: ✭ 15 (-51.61%)
Mutual labels:  benchmark

StringBench

Build Status

This project compares different java string matching implementations with a performance test.

Latest Benchmarks

Benchmark quality

  • We use jmh benchmarking.
  • The benchmark is not (yet) tested on a clean machine
  • The benchmark has few iterations (making its results less significant)

Subject of this Benchmark

This benchmark is designed for following objectives:

Comparing different algorithms

It is quite useful to know which algorithm to use in which scenario.

Comparing different implementation

Yet a flawed implementation is easier identified if compared with other implementations (that deviate strongly in performance).

Minor deviations between the performance of different implementations of the same algorithm should be ignored

  • some are caused by the adaptor benchmark code
  • some frameworks use a sophisticated API and therefore do some more work

The comparison also tests the deviations of the algorithms if applied to different text sources. At this time the benchmark contains

  • Searching in a string (of type java.lang.String)
  • Searching in a text file (not necessarily converting it to String or byte[])

Some implementations strongly deviate in performance if compared in this way.

Recommendation

Use this benchmark only to select the algorithm of your choice and then select the implementation that is most suiting your requirements.

An Overview of libraries

Java API

Java provides two ways to search for strings:

Both algorithms are very stable, passing all benchmarks. The naive algorithm will not perform well on large texts/patterns. The boyer-moore is actively challenged by other implementations below.

StringSearchAlgorithms (SC)

My own library StringSearchAlgorithms provides many algorithms for single and multiple patterns along with some experimental features (e.g. regex search), providing the algorithms:

  • BNDM
  • BOM
  • Horspool
  • Knuth-Morris-Pratt
  • Sunday
  • Aho-Corasick-
  • Set Backward Oracle Matching
  • Set Horspool
  • Wu-Manber

I am continuously improving the design and trying to keep the test coverage near 100%. It is actively maintained and passes all tests of the benchmark.

ByteSeek (BS)

byteseek is a library for efficiently matching patterns of bytes and search for those patterns, providing the algorithms:

  • Horspool (and variants)
  • Sunday
  • Set Horspool (and variants)
  • Wu-Manber

byteseek provides a well designed API, is actively maintained and passes all tests of the benchmark for single patterns. For multiple patterns tests do not pass yet, but the maintainer works on it.

StringSearch (SS)

StringSearch is a popular string searching library for single pattern search (and also wildcard search) claiming to do high performance string search. It does not pass the benchmark tests and is therefore excluded from the benchmark. String search for simple patterns takes minutes where if String.indexOf takes few seconds. The maintainer refused to clarify this issue.

AhoCorasick (AC)

AhoCorasick is a popular one-algorithm library that implements the Aho-Corasick algorithm. It does not pass the benchmark tests and is therefore excluded from the benchmark. Yet waiting very long for a maintainer statement.

AhoCorasickDoubleArrayTrie (ACDA)

AhoCorasick is a very fast implementation of the Aho-Corasick algorithm. It passed the benchmark without modification. The memory consumption does not exceed the memory given by the incubation tests, yet we cannot say how it does compare to the other implementations.

Participating

If you want another framework participating in this benchmarks, meet following conditions:

  • the framework must be available in a public maven repository
  • provide a Benchmark implementation (extending SinglePatternMatcherBenchmark or MultiPatternMatcherBenchmark)
  • the benchmarked algorithms have to pass the tests without deviations

Interpretation of the results of 2015-11-22

  • Participating: SDK, SC, SS
  • Single Pattern
    • Pattern.compile/Matcher.find works fine for small alphabets and small patterns
    • SS BNDM is the best for small alphabets and longer patterns
    • SC Horspool is best for large alphabets and longer patterns
    • differences are minimal
  • Multi Pattern
    • SC AhoCorasick performs best for small patterns and small alphabets
    • SC WuManber performs best for long patterns with small alphabets
    • SC SetBackwardOracleMatching performs best for long patterns with middle sized alphabets
    • SC SetHorspool performs best for long patterns with large alphabets

Interpretation of the results of 2015-12-06

  • Participating: SDK, SC, SS, AC
  • (AC) AhoCorassic is remarkably fast with many patterns, but in this setup it is not reliable: The results for patterns/samples deviate from those returned by other algorithms (even the naive brute force ones). Therefore it will be excluded from the benchmark until the source of this deviation is clarified
  • It gets obvious that the benchmark setup is not fair: Patterns are generated as substrings from the sample, which ensures that at least one match could be found. Brute force algorithms based on backtracking will match this pattern very quick. In scenarios with large alphabets such brute force algorithms will probably backtrack only small distance. Real random data would probably penalize brute force algorithms compared to the current setup.
  • Single Pattern
    • (SDK) Simple String.indexOf dominates the region of small patterns with small alphabet
    • (SC) ShiftAnd performs good for small alphabet size and smaller patterns
    • (SC) BNDM performs good for small alphabet size and larger patterns
    • (SC) Horspool/(SC) Sunday/(SDK) Pattern.compile/Matcher.find cover the region of large alphabets and large patterns
  • Multi Pattern
    • (SDK) Simple String.indexOf dominates the region of few patterns of small pattern size
    • (SC) SetHorspool performs best for few patterns of large alphabet and size
    • (SC) AhoCorasick performs best for more patterns of small alphabet and size
    • (SC) WuManber performs best for more patterns of medium alphabet and size
    • (SC) SetBackwardOracleMatching performs best for more patterns with large alphabets

Interpretation of the results of 2015-12-13

  • Participating: SDK, SC, SS
  • The benchmark is yet not optimal. It does search all non overlapping matches in a text. This makes it easier for the naive search, because it can skip the search to the point after the last match. The other algorithms collect all matches and filter out the overlapping ones. Certainly this is a good reason to optimize the non-overlapping matching algorithms. Yet it is also a reason to optimize the benchmark that each algorithm is also tested for overlapping matches.
  • Single Pattern
    • (SDK) Simple String.indexOf dominates the region of small patterns with small alphabet
    • (SC) ShiftAnd performs good for small alphabet size and smaller patterns
    • (SC) BNDM performs good for small alphabet size and larger patterns
    • (SC) Horspool/(SC) Sunday/(SDK) Pattern.compile/Matcher.find cover the region of large alphabets and large patterns
  • Multi Pattern
    • (SDK) Simple String.indexOf dominates the region of few patterns of small pattern size
    • (SC) SetHorspool performs good for few patterns of large alphabet
    • (SC) AhoCorasick performs good for more patterns of small pattern size
    • (SC) WuManber performs good for more patterns of medium alphabet and size
    • (SC) SetBackwardOracleMatching performs best for more patterns with large pattern size

Interpretation of the results of 2016-08-13

  • Participating: SDK, SC, BS
  • Excluded: AC, SS (api problems)
  • The benchmark has a bias towards String.indexOf (is reported as issue)
  • The new benchmark byteseek looks very slow, but we found out that the main performance overhead was produced in the test setup
  • The main picture is for Single Pattern:
    • small patterns or long alphabets => SDK indexOf
    • small alphabets/small pattern size => SC ShiftAnd
    • small alphabets/medium pattern size => SC BNDM
    • small alphabets/large pattern size => SC BOM
    • the region between => Boyer-Moore variants (SC Sunday, SC Horspool, SDK Regex)
  • for Multiple Patterns:
    • small patterns and small alphabets = SC Aho-Corasick
    • few small patterns and large alphabets = SC Set Horspool
    • few medium patterns and medium alphabets = SC Wu-Manber
    • many large patterns and large alphabets = SC Set BOM

Interpretation of the results of 2016-10-31

  • Participating: SDK, SC, BS
  • Excluded: AC, SS, RA (api problems)
  • Byteseek is better optimized for files and streams, StringSearchAlgorithms is better for strings
  • Finding Single Patterns in strings:
    • small patterns or long alphabets => SDK indexOf
    • small alphabets/small pattern size => SC ShiftAnd
    • small alphabets/medium pattern size => SC BNDM
    • small alphabets/large pattern size => SC BOM
    • the region between => Boyer-Moore variants (SC Sunday (Relaxed), SC Horspool (Relaxed), SDK Regex)
  • Finding Single Patterns in files:
    • very small patterns and small alphabets => SDK indexOf
    • small alphabets/medium pattern size => SC BNDM
    • small alphabets/large pattern size => SC BOM
    • large alphabets => Boyer Moore variants (BS BoyerMooreHorspool, BS HorspoolfinalFlag, BS Sunday)
  • Finding Multiple Patterns (either in strings and files):
    • small patterns and small alphabets = SC Aho-Corasick
    • small patterns and large alphabets = SC Set Horspool (Relaxed)
    • few medium patterns and medium alphabets = SC Wu-Manber
    • many large patterns and large alphabets = SC Set BOM

Switching Visualization Software

We changed the visualization of best performing benchmarks. Yet we will only maintain the most current benchmarks. Older benchmarks will be available as csv, but not visualized.

Interpretation of the results of 2017-04-04

  • Participating: SDK, SC, BS, ACDA
  • Excluded: AC, SS, RA (api problems)
  • New Participants:
    • AhoCorasickDoubleArrayTrie (ACDA)
  • New Winners
    • AhoCorasickDoubleArrayTrie (ACDA) for large pattern number, small alphabet size, small pattern size

Interpretation of the results of 2019-01-05

  • Participating: SDK, SC, BS, ACDA
  • Excluded: AC, SS, RA (api problems)
  • The multi pattern algorithms were tested on a faster machine
  • New Winners
    • SetHorspool (SC) for large pattern number, large alphabet size, medium pattern size
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].