All Projects β†’ rockymadden β†’ Stringmetric

rockymadden / Stringmetric

🎯 String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein).

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Stringmetric

similar-english-words
Give me a word and I’ll give you an array of words that differ by a single letter.
Stars: ✭ 25 (-94.8%)
Mutual labels:  distance, levenshtein
levenshtein-edit-distance
Levenshtein edit distance
Stars: ✭ 59 (-87.73%)
Mutual labels:  distance, levenshtein
Stopwords
Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.
Stars: ✭ 83 (-82.74%)
Mutual labels:  distance, levenshtein
simetric
String similarity metrics for Elixir
Stars: ✭ 59 (-87.73%)
Mutual labels:  distance, levenshtein
Textdistance
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Stars: ✭ 2,575 (+435.34%)
Mutual labels:  distance, levenshtein
stringosim
String similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...
Stars: ✭ 47 (-90.23%)
Mutual labels:  distance, levenshtein
Automated-Social-Distancing-Monitoring
automated social distancing monitoring system
Stars: ✭ 1 (-99.79%)
Mutual labels:  distance
NearestNeighborDescent.jl
Efficient approximate k-nearest neighbors graph construction and search in Julia
Stars: ✭ 34 (-92.93%)
Mutual labels:  distance
geojson-python-utils
Python helper functions for manipulating GeoJSON
Stars: ✭ 86 (-82.12%)
Mutual labels:  distance
similarity measures
Quantify the difference between two arbitrary curves in space
Stars: ✭ 140 (-70.89%)
Mutual labels:  distance
Pyemd
Fast EMD for Python: a wrapper for Pele and Werman's C++ implementation of the Earth Mover's Distance metric
Stars: ✭ 361 (-24.95%)
Mutual labels:  distance
Cheap Ruler
Fast approximations for common geodesic measurements 🌐
Stars: ✭ 334 (-30.56%)
Mutual labels:  distance
dodgr
Distances on Directed Graphs in R
Stars: ✭ 106 (-77.96%)
Mutual labels:  distance
pubg mobile memory hacking
Pubg Mobile Emulator Gameloop Memory Hacking C++ Source Code. Ex: Name, Cords, Bones, Weapons, Items, Box, Drop, Aimbot etc.
Stars: ✭ 69 (-85.65%)
Mutual labels:  distance
Go Edlib
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...
Stars: ✭ 253 (-47.4%)
Mutual labels:  levenshtein
Geolib
Zero dependency library to provide some basic geo functions
Stars: ✭ 3,675 (+664.03%)
Mutual labels:  distance
dist
πŸ—ΊοΈ Python/C API extension module that computes distance between two coordinates on the world map
Stars: ✭ 13 (-97.3%)
Mutual labels:  distance
TriangleMeshDistance
Header only, single file, simple and efficient C++11 library to compute the signed distance function (SDF) to a triangle mesh
Stars: ✭ 55 (-88.57%)
Mutual labels:  distance
Three Mesh Bvh
A BVH implementation to speed up raycasting against three.js meshes.
Stars: ✭ 302 (-37.21%)
Mutual labels:  distance
Multi-Face-Comparison
This repo is meant for backend API for face comparision and computer vision. It is built on python flask framework
Stars: ✭ 20 (-95.84%)
Mutual labels:  distance

#stringmetric Build Status String metrics and phonetic algorithms for Scala. The library provides facilities to perform approximate string matching, measurement of string similarity/distance, indexing by word pronunciation, and sounds-like comparisons. In addition to the core library, each metric and algorithm has a command line interface.

Metrics and algorithms

Depending upon

SBT:

libraryDependencies += "com.rockymadden.stringmetric" %% "stringmetric-core" % "0.27.4"

Gradle:

compile 'com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.4'

Maven:

<dependency>
	<groupId>com.rockymadden.stringmetric</groupId>
	<artifactId>stringmetric-core_2.10</artifactId>
	<version>0.27.4</version>
</dependency>

Similarity package

Useful for approximate string matching and measurement of string distance. Most metrics calculate the similarity of two strings as a double with a value between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.


Dice / Sorensen Metric:

DiceSorensenMetric(1).compare("night", "nacht") // 0.6
DiceSorensenMetric(1).compare("context", "contact") // 0.7142857142857143

Note you must specify the size of the n-gram you wish to use.


Hamming Metric:

HammingMetric.compare("toned", "roses") // 3
HammingMetric.compare("1011101", "1001001") // 2

Note the exception of integers, rather than doubles, being returned.


Jaccard Metric:

JaccardMetric(1).compare("night", "nacht") // 0.3
JaccardMetric(1).compare("context", "contact") // 0.35714285714285715

Note you must specify the size of the n-gram you wish to use.


Jaro Metric:

JaroMetric.compare("dwayne", "duane") // 0.8222222222222223
JaroMetric.compare("jones", "johnson") // 0.7904761904761904
JaroMetric.compare("fvie", "ten") // 0.0

Jaro-Winkler Metric:

JaroWinklerMetric.compare("dwayne", "duane") // 0.8400000000000001
JaroWinklerMetric.compare("jones", "johnson") // 0.8323809523809523
JaroWinklerMetric.compare("fvie", "ten") // 0.0

Levenshtein Metric:

LevenshteinMetric.compare("sitting", "kitten") // 3
LevenshteinMetric.compare("cake", "drake") // 2

Note the exception of integers, rather than doubles, being returned.


N-Gram Metric:

NGramMetric(1).compare("night", "nacht") // 0.6
NGramMetric(2).compare("night", "nacht") // 0.25
NGramMetric(2).compare("context", "contact") // 0.5

Note you must specify the size of the n-gram you wish to use.


Overlap Metric:

OverlapMetric(1).compare("night", "nacht") // 0.6
OverlapMetric(1).compare("context", "contact") // 0.7142857142857143

Note you must specify the size of the n-gram you wish to use.


Ratcliff/Obershelp Metric:

RatcliffObershelpMetric.compare("aleksander", "alexandre") // 0.7368421052631579
RatcliffObershelpMetric.compare("pennsylvania", "pencilvaneya") // 0.6666666666666666

Weighted Levenshtein Metric:

WeightedLevenshteinMetric(10, 0.1, 1).compare("book", "back") // 2
WeightedLevenshteinMetric(10, 0.1, 1).compare("hosp", "hospital") // 0.4
WeightedLevenshteinMetric(10, 0.1, 1).compare("hospital", "hosp") // 40

Note you must specify the weight of each operation. Delete, insert, and then substitute. Note that while a double is returned, it can be outside the range of 0 to 1, based upon the weights used.


Phonetic package

Useful for indexing by word pronunciation and performing sounds-like comparisons. All metrics return a boolean value indicating if the two strings sound the same, per the algorithm used. All metrics have an algorithm counterpart which provide the means to perform indexing by word pronunciation.


Metaphone Metric:

MetaphoneMetric.compare("merci", "mercy") // true
MetaphoneMetric.compare("dumb", "gum") // false

Metaphone Algorithm:

MetaphoneAlgorithm.compute("dumb") // tm
MetaphoneAlgorithm.compute("knuth") // n0

NYSIIS Metric:

NysiisMetric.compare("ham", "hum") // true
NysiisMetric.compare("dumb", "gum") // false

NYSIIS Algorithm:

NysiisAlgorithm.compute("macintosh") // mcant
NysiisAlgorithm.compute("knuth") // nnat

Refined NYSIIS Metric:

RefinedNysiisMetric.compare("ham", "hum") // true
RefinedNysiisMetric.compare("dumb", "gum") // false

Refined NYSIIS Algorithm:

RefinedNysiisAlgorithm.compute("macintosh") // mcantas
RefinedNysiisAlgorithm.compute("westerlund") // wastarlad

Refined Soundex Metric:

RefinedSoundexMetric.compare("robert", "rupert") // true
RefinedSoundexMetric.compare("robert", "rubin") // false

Refined Soundex Algorithm:

RefinedSoundexAlgorithm.compute("hairs") // h093
RefinedSoundexAlgorithm.compute("lambert") // l7081096

Soundex Metric:

SoundexMetric.compare("robert", "rupert") // true
SoundexMetric.compare("robert", "rubin") // false

Soundex Algorithm:

SoundexAlgorithm.compute("rupert") // r163
SoundexAlgorithm.compute("lukasiewicz") // l222

Convenience objects

StringAlgorithm:

StringAlgorithm.computeWithMetaphone("abcdef")
StringAlgorithm.computeWithNysiis("abcdef")

StringMetric:

StringMetric.compareWithJaccard(1)("abcdef", "abcxyz")
StringMetric.compareWithJaroWinkler("abcdef", "abcxyz")

Decorating

It is possible to decorate algorithms and metrics with additional functionality, which you can mix and match. Decorations include:

  • withMemoization: Computations and comparisons are cached. Future calls made with identical arguments will be looked up, rather than computed.

  • withTransform: Transform arguments prior to computation/comparison. A handful of pre-built transforms are located in the transform module.


Non-decorated:

MetaphoneAlgorithm.compute("abcdef")
MetaphoneMetric.compare("abcdef", "abcxyz")

Using memoization:

(MetaphoneAlgorithm withMemoization).compute("abcdef")

Using a transform so that we only examine alphabetical characters:

(MetaphoneAlgorithm withTransform filterAlpha).compute("abcdef")
(MetaphoneMetric withTransform filterAlpha).compare("abcdef", "abcxyz")

Using a functionally composed transform so that we only examine alphabetical characters, but the case will not matter:

val composedTransform = (filterAlpha andThen ignoreAlphaCase)

(MetaphoneAlgorithm withTransform composedTransform).compute("abcdef")
(MetaphoneMetric withTransform composedTransform).compare("abcdef", "abcxyz")

Making your own transform:

val myTransform: StringTransform = (ca) => ca.filter(_ == 'x')

(MetaphoneAlgorithm withTransform myTransform).compute("abcdef")
(MetaphoneMetric withTransform myTransform).compare("abcdef", "abcxyz")

Using memoization and a transform:

((MetaphoneAlgorithm withMemoization) withTransform filterAlpha).compute("abcdef")

Building the CLIs

$ git clone https://github.com/rockymadden/stringmetric.git
$ cd stringmetric
$ sbt clean package
$ ./project/build.sh
$ ./target/cli/jarometric abc xyz

Using the CLIs

Get help:

$ metaphonemetric --help
Compares two strings to determine if they are phonetically similarly, per the Metaphone algorithm.

Syntax:
  metaphonemetric [Options] string1 string2...

Options:
  -h, --help
    Outputs description, syntax, and options.

Get comparison value with metrics:

$ jarowinklermetric dog dawg
0.75

Get representation value with phonetic algorithms:

$ metaphonealgorithm dog
tk

License

The MIT License (MIT)

Copyright (c) 2013 Rocky Madden (http://rockymadden.com/)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].