Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → fujimotos → polyleven

fujimotos / polyleven

Licence: MIT license

Fast Levenshtein Distance Library for Python 3

Programming Languages

50402 projects - #5 most used programming language

139335 projects - #7 most used programming language

30231 projects

Labels

levenshtein-distance

Projects that are alternatives of or similar to polyleven

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Stars: ✭ 38 (+2.7%)

Mutual labels: levenshtein-distance

Fast approximate strings search & spelling correction

Stars: ✭ 52 (+40.54%)

Mutual labels: levenshtein-distance

spellchecker-wasm

SpellcheckerWasm is an extrememly fast spellchecker for WebAssembly based on SymSpell

Stars: ✭ 46 (+24.32%)

Mutual labels: levenshtein-distance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Stars: ✭ 60 (+62.16%)

Mutual labels: levenshtein-distance

Making the quickest and most memory efficient implementation of Levenshtein Distance with SIMD and Threading support

Stars: ✭ 204 (+451.35%)

Mutual labels: levenshtein-distance

seqalign pathing

Rust implementation of sequence alignment / Levenshtein distance by A* acceleration of the DP algorithm

Stars: ✭ 17 (-54.05%)

Mutual labels: levenshtein-distance

📐 A Cython implementation of the affine gap string distance

Stars: ✭ 57 (+54.05%)

Mutual labels: levenshtein-distance

Levenshtein algorithm in C

Stars: ✭ 77 (+108.11%)

Mutual labels: levenshtein-distance

Java String Similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

Stars: ✭ 2,403 (+6394.59%)

Mutual labels: levenshtein-distance

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Stars: ✭ 2,575 (+6859.46%)

Mutual labels: levenshtein-distance

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

Stars: ✭ 1,976 (+5240.54%)

Mutual labels: levenshtein-distance

sentences-similarity-cluster

Calculate similarity of sentences & Cluster the result.

Stars: ✭ 14 (-62.16%)

Mutual labels: levenshtein-distance

Text2Text: Cross-lingual natural language processing and generation toolkit

Stars: ✭ 188 (+408.11%)

Mutual labels: levenshtein-distance

customized-symspell

Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

Stars: ✭ 51 (+37.84%)

Mutual labels: levenshtein-distance

FastFuzzyStringMatcherDotNet

A BK tree implementation for fast fuzzy string matching

Stars: ✭ 23 (-37.84%)

Mutual labels: levenshtein-distance

Polyleven -- Fast Pythonic Levenshtein Library

Website:	https://ceptord.net/
Latest Release:	v0.7 (2021-02-23)
License:	MIT License

1. Introduction

polyleven is a Pythonic Levenshtein distance library that:

Is fast independent of input types, and hence can be used for both short (like English words) and long input types (like DNA sequences).
Can be used readily in a manner not covered by restrictive licenses such as GPL, hence can be used freely in private codes.
Supports Python 3.x.

2. How to install

The official package is available on PyPI:

$ pip install polyleven

3. How to use

Polyleven provides a single interface function levenshtein(). You can use this function to measure the similarity of two strings.

>>> from polyleven import levenshtein
>>> levenshtein('aaa', 'ccc')
3

If you only care about distances under a certain threshold, you can pass the max threshold to the third argument.

>>> levenshtein('acc', 'ccc', 1)
1
>>> levenshtein('aaa', 'ccc', 1)
2

In general, you can gain a noticeable speed boost with threshold k < 3.

4. Benchmark

4.1 English Words

To compare Polyleven with other Pythonic edit distance libraries, a million word pairs was generated from SCOWL.

Each library was measured how long it takes to evaluate all of these words. The following table summarises the result:

Function Name	TIME[sec]	SPEED[pairs/s]
edlib	4.763	208216
editdistance	1.943	510450
jellyfish.levenshtein_distance	0.722	1374081
distance.levenshtein	0.623	1591396
Levenshtein.distance	0.500	1982764
polyleven.levenshtein	0.431	2303420

4.2. Longer Inputs

To evaluate the efficiency for longer inputs, I created 5000 pairs of random strings of size 16, 32, 64, 128, 256, 512 and 1024.

Each library was measured how fast it can process these entries. [1]

Library	N=16	N=32	N=64	N=128	N=256	N=512	N=1024
edlib	0.040	0.063	0.094	0.205	0.432	0.908	2.089
editdistance	0.027	0.049	0.086	0.178	0.336	0.740	58.139
jellyfish	0.009	0.032	0.118	0.470	1.874	8.877	42.848
distance	0.007	0.029	0.109	0.431	1.726	6.950	27.998
Levenshtein	0.006	0.022	0.085	0.336	1.328	5.286	21.097
polyleven	0.003	0.005	0.010	0.043	0.149	0.550	2.109

3.3. List of Libraries

Library	Version	URL
edlib	v1.2.1	https://github.com/Martinsos/edlib
editdistance	v0.4	https://github.com/aflc/editdistance
jellyfish	v0.5.6	https://github.com/jamesturk/jellyfish
distance	v0.1.3	https://github.com/doukremt/distance
Levenshtein	v0.12	https://github.com/ztane/python-Levenshtein
polyleven	v0.3	https://github.com/fujimotos/polyleven

[1]	Measured using Python 3.5.3 on Debian Jessie with Intel Core i3-4010U (1.70GHz)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 37

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗