All Projects → hyperjumptech → beda

hyperjumptech / beda

Licence: other
Beda is a golang library for detecting how similar a two string

Programming Languages

go
31211 projects - #10 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to beda

stance
Learned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-20.59%)
Mutual labels:  string-distance, string-matching, string-similarity
strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+235.29%)
Mutual labels:  string-distance, string-matching, string-similarity
Levenshtein
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
Stars: ✭ 38 (+11.76%)
Mutual labels:  string-matching, string-similarity
strsim
string similarity based on Dice's coefficient in go
Stars: ✭ 39 (+14.71%)
Mutual labels:  string-matching, string-similarity
fuzzywuzzy
Fuzzy string matching for PHP
Stars: ✭ 60 (+76.47%)
Mutual labels:  string-distance, string-matching
simplematch
Minimal, super readable string pattern matching for python.
Stars: ✭ 147 (+332.35%)
Mutual labels:  string-matching
vmo
Python Modules of Variable Markov Oracle
Stars: ✭ 23 (-32.35%)
Mutual labels:  string-matching
Quickenshtein
Making the quickest and most memory efficient implementation of Levenshtein Distance with SIMD and Threading support
Stars: ✭ 204 (+500%)
Mutual labels:  string-distance
vbml
Way to check, match and resist.
Stars: ✭ 27 (-20.59%)
Mutual labels:  string-matching
FastFuzzyStringMatcherDotNet
A BK tree implementation for fast fuzzy string matching
Stars: ✭ 23 (-32.35%)
Mutual labels:  string-matching
TeamReference
Team reference for Competitive Programming. Algorithms implementations very used in the ACM-ICPC contests. Latex template to build your own team reference.
Stars: ✭ 29 (-14.71%)
Mutual labels:  string-matching
hyperdiff
Find common, removed and added element between two collections.
Stars: ✭ 14 (-58.82%)
Mutual labels:  difference
node-red-contrib-string
Provides a string manipulation node with a chainable UI based on the concise and lightweight stringjs.com.
Stars: ✭ 15 (-55.88%)
Mutual labels:  string-matching
string-similarity-js
Lightweight string similarity function for javascript
Stars: ✭ 29 (-14.71%)
Mutual labels:  string-similarity
effcee
Effcee is a C++ library for stateful pattern matching of strings, inspired by LLVM's FileCheck
Stars: ✭ 76 (+123.53%)
Mutual labels:  string-matching
algos
A collection of algorithms in rust
Stars: ✭ 16 (-52.94%)
Mutual labels:  string-matching
speech-recognition-evaluation
Evaluate results from ASR/Speech-to-Text quickly
Stars: ✭ 25 (-26.47%)
Mutual labels:  difference
fuzzy-match
Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.
Stars: ✭ 31 (-8.82%)
Mutual labels:  string-matching
AnyDiff
A CSharp (C#) diff library that allows you to diff two objects and get a list of the differences back.
Stars: ✭ 80 (+135.29%)
Mutual labels:  difference
seqalign
Collection of sequence alignment algorithms.
Stars: ✭ 20 (-41.18%)
Mutual labels:  string-distance

BEDA

Build Status License

Get BEDA

go get github.com/hyperjumptech/beda

Introduction

BEDA is a golang library to detect differences or similarities between two words or string. Some time you want to detect whether a string is "the same" or "somehow similar to" another string. Suppose your system wants to detect whenever the user is putting bad-word as their user name, or to forbid them from using unwanted words in their postings. You need to implement some, not so easy , algorithm to do this task.

BEDA contains implementation of algorithm for detecting word differences. They are

  1. Levenshtein Distance : A string metric for measuring the difference between two sequences. Wikipedia
  2. Trigram or n-gram : A contiguous sequence of n items from a given sample of text or speech. Wikipedia
  3. Jaro & Jaro Winkler Distance : A string metric measuring an edit distance between two sequences. Wikipedia

BEDA is an Indonesia word for "different".

Usage

import "github.com/hyperjumptech/beda"

sd := beda.NewStringDiff("The First String", "The Second String")
lDist := sd.LevenshteinDistance()
tDiff := sd.TrigramCompare()
jDiff := sd.JaroDistance()
jwDiff := sd.JaroWinklerDistance(0.1)

fmt.Printf("Levenshtein Distance is %d \n", lDist)
fmt.Printf("Trigram Compare is is %f \n", lDist)
fmt.Printf("Jaro Distance is is %d \n", jDiff)
fmt.Printf("Jaro Wingkler Distance is %d \n", jwDiff)

Algorithms and APIs

String comparison is not so easy. There are a couple of algorithm to do this comparison, and each of them yield different result. Thus may suited for one purposses compared to the other.

To understand how and when or which algorithm should benefit your string comparisson quest, Please read this String similarity algorithms compared. Read them through, they will help you, a lot.

type StringDiff struct {
    S1 string
	S2 string
}

Levenshtein Distance

LevenshteinDistance is the minimum number of single-character edits required to change one word into the other, so the result is a positive integer. The algorithm is sensitive to string length. Which make it more difficult to draw pattern.

Reading :

API :

func LevenshteinDistance(s1, s2 string) int
func (sd *StringDiff) LevenshteinDistance() int

s1 is the first string to compare
s2 is the second string to compare
The closer return value to 0 means the more similar the two words.

Example :

sd := beda.NewStringDiff("abcd", "bc")
lDist := sd.LevenshteinDistance()
fmt.Printf("Distance is %d \n", lDist) // prints : Distance is 2

or

fmt.Printf("Distance is %d \n", beda.LevenshteinDistance("abcd", "bc"))

Damerau-Levenshtein Distance

(From Wikipedia) Damerau-Levenshtein Distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).

Reading :

API :

func DamerauLevenshteinDistance(s1, s2 string) int
func (sd *StringDiff) DamerauLevenshteinDistance(deleteCost, insertCost, replaceCost, swapCost int) int

func DamerauLevenshteinDistance take 2 arguments,
s1 is the first string to compare
s2 is the second string to compare
The closer return value to 0 means the more similar the two words. This function uses the default value of 1 for all deleteCost, insertCost, replaceCost and swapCost

func (sd *StringDiff) DamerauLevenshteinDistance takes 4 arguments,
deleteCost is multiplier factor for delete operation
insertCost is multiplier factor for insert operation
replaceCost is multiplier factor for replace operation
swapCost is multiplier factor for swap operation
A multiplier value enable us to weight on how impactful each of the operation contributing to the change distance.

Example :

sd := beda.NewStringDiff("abcd", "bc")
lDist := sd.DamerauLevenshteinDistance(1,1,1,1)
fmt.Printf("Distance is %d \n", lDist) // prints : Distance is 2

or

fmt.Printf("Distance is %d \n", beda.DamerauLevenshteinDistance("abcd", "bc"))

TriGram Compare

TrigramCompare is a case of n-gram, a contiguous sequence of n (three, in this case) items from a given sample. In our case, an application name is a sample and a character is an item.

Reading:

API :

func TrigramCompare(s1, s2 string) float32
func (sd *StringDiff) TrigramCompare() float32

s1 is the first string to compare
s2 is the second string to compare
The closer the result to 1 (one) means that the word is closer 100% similarities in 3 grams sequence.

Example :

sd := beda.NewStringDiff("martha", "marhta")
diff := sd.TrigramCompare()
fmt.Printf("Differences is %f \n", diff) 

or

fmt.Printf("Distance is %f \n", beda.TrigramCompare("martha", "marhta"))

Jaro Distance

JaroDistance distance between two words is the minimum number of single-character transpositions required to change one word into the other.

API :

func JaroDistance(s1, s2 string) float32
func (sd *StringDiff) JaroDistance() float32

s1 is the first string to compare
s2 is the second string to compare
The closer the result to 1 (one) means that the word is closer 100% similarities

Example :

sd := beda.NewStringDiff("martha", "marhta")
diff := sd.JaroDistance()
fmt.Printf("Differences is %f \n", diff) 

or

fmt.Printf("Distance is %f \n", beda.JaroDistance("martha", "marhta"))

Jaro Wingkler Distance

JaroWinklerDistance uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length

Reading :

API :

func JaroWinklerDistance(s1, s2 string) float32
func (sd *StringDiff) JaroWinklerDistance(p float32) float32

or

fmt.Printf("Distance is %f \n", beda.JaroWinklerDistance("martha", "marhta"))

s1 is the first string to compare
s2 is the second string to compare
p argument is constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler’s work is p = 0.1

The closer the result to 1 (one) means that the word is closer 100% similarities

Example :

sd := beda.NewStringDiff("martha", "marhta")
diff := sd.JaroWinklerDistance(0.1)
fmt.Printf("Differences is %f \n", diff) 

Tasks and Help Wanted.

Yes. We need contributor to make BEDA even better and useful to Open Source Community.

If you really want to help us, simply Fork the project and apply for Pull Request. Please read our Contribution Manual and Code of Conduct

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].