All Projects → MrPowers → ceja

MrPowers / ceja

Licence: other
PySpark phonetic and string matching algorithms

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ceja

spark-stringmetric
Spark functions to run popular phonetic and string matching algorithms
Stars: ✭ 51 (+112.5%)
Mutual labels:  jaro-winkler, nysiis, hamming-distance
edits.cr
Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-33.33%)
Mutual labels:  jaro-winkler, damerau-levenshtein
stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (+150%)
Mutual labels:  jaro-winkler, hamming-distance
eddie
No description or website provided.
Stars: ✭ 18 (-25%)
Mutual labels:  jaro-winkler, damerau-levenshtein
strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+375%)
Mutual labels:  jaro-winkler, hamming-distance
Textdistance
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Stars: ✭ 2,575 (+10629.17%)
Mutual labels:  damerau-levenshtein, hamming-distance
Jellyfish
🎐 a python library for doing approximate and phonetic matching of strings.
Stars: ✭ 1,571 (+6445.83%)
Mutual labels:  jaro-winkler, metaphone
Java String Similarity
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Stars: ✭ 2,403 (+9912.5%)
Mutual labels:  jaro-winkler, damerau-levenshtein
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Stars: ✭ 58 (+141.67%)
Mutual labels:  pyspark
flask-spark-docker
Just a boilerplate for PySpark and Flask
Stars: ✭ 32 (+33.33%)
Mutual labels:  pyspark
isarn-sketches-spark
Routines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (+16.67%)
Mutual labels:  pyspark
learn-by-examples
Real-world Spark pipelines examples
Stars: ✭ 84 (+250%)
Mutual labels:  pyspark
anovos
Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
Stars: ✭ 77 (+220.83%)
Mutual labels:  pyspark
jgit-spark-connector
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Stars: ✭ 71 (+195.83%)
Mutual labels:  pyspark
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (+112.5%)
Mutual labels:  pyspark
pyspark-cassandra
pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
Stars: ✭ 70 (+191.67%)
Mutual labels:  pyspark
OSCI
Open Source Contributor Index
Stars: ✭ 107 (+345.83%)
Mutual labels:  pyspark
spark3D
Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
Stars: ✭ 23 (-4.17%)
Mutual labels:  pyspark
double-metaphone
Fast Double Metaphone algorithm
Stars: ✭ 70 (+191.67%)
Mutual labels:  metaphone
oshinko-s2i
This is a place to put s2i images and utilities for spark application builders for openshift
Stars: ✭ 16 (-33.33%)
Mutual labels:  pyspark

ceja

PySpark phonetic, stemming, and string matching algorithms. Use the power of PySpark to run these algos on massive datasets!

Installation and basic usage

Run pip install ceja to install the library.

Import the functions with import ceja. After importing the code you can run functions like ceja.nysiis, ceja.jaro_winkler_similarity, etc.

Public interface summary

  • Phonetic algorithms
    • nysiis
    • metaphone
    • match_rating_codex
  • Stemming
    • porter_stem
  • String similarity
    • damerau_levenshtein_distance
    • hamming_distance
    • jaro_similarity
    • jaro_winkler_similarity
    • match_rating_comparison

Phonetic algorithms

NYSIIS

data = [
    ("jellyfish",),
    ("li",),
    ("luisa",),
    (None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_nysiis", ceja.nysiis(col("word")))
actual_df.show()
+---------+-----------+
|     word|word_nysiis|
+---------+-----------+
|jellyfish|      JALYF|
|       li|          L|
|    luisa|        LAS|
|     null|       null|
+---------+-----------+

Metaphone

data = [
    ("jellyfish",),
    ("li",),
    ("luisa",),
    ("Klumpz",),
    ("Clumps",),
    (None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_metaphone", ceja.metaphone(col("word")))
actual_df.show()
+---------+--------------+
|     word|word_metaphone|
+---------+--------------+
|jellyfish|          JLFX|
|       li|             L|
|    luisa|            LS|
|   Klumpz|         KLMPS|
|   Clumps|         KLMPS|
|     null|          null|
+---------+--------------+

Match rating codex

data = [
    ("jellyfish",),
    ("li",),
    ("luisa",),
    (None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_match_rating_codex", ceja.match_rating_codex(col("word")))
actual_df.show()
+---------+-----------------------+
|     word|word_match_rating_codex|
+---------+-----------------------+
|jellyfish|                 JLYFSH|
|       li|                      L|
|    luisa|                     LS|
|     null|                   null|
+---------+-----------------------+

Stemming algorithms

Porter stem

data = [
    ("chocolates",),
    ("chocolatey",),
    ("choco",),
    (None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_porter_stem", ceja.porter_stem(col("word")))
actual_df.show()
+----------+----------------+
|      word|word_porter_stem|
+----------+----------------+
|chocolates|          chocol|
|chocolatey|      chocolatei|
|     choco|           choco|
|      null|            null|
+----------+----------------+

Similarity algorithms

Damerau Levenshtein Distance

data = [
    ("jellyfish", "smellyfish"),
    ("li", "lee"),
    ("luisa", "bruna"),
    (None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("damerau_levenshtein_distance", ceja.damerau_levenshtein_distance(col("word1"), col("word2")))
actual_df.show()
+---------+----------+----------------------------+
|    word1|     word2|damerau_levenshtein_distance|
+---------+----------+----------------------------+
|jellyfish|smellyfish|                           2|
|       li|       lee|                           2|
|    luisa|     bruna|                           4|
|     null|      null|                        null|
+---------+----------+----------------------------+

Hamming distance

data = [
    ("jellyfish", "smellyfish"),
    ("li", "lee"),
    ("luisa", "bruna"),
    (None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("hamming_distance", ceja.hamming_distance(col("word1"), col("word2")))
print("\nHamming distance")
actual_df.show()
+---------+----------+----------------+
|    word1|     word2|hamming_distance|
+---------+----------+----------------+
|jellyfish|smellyfish|               9|
|       li|       lee|               2|
|    luisa|     bruna|               4|
|     null|      null|            null|
+---------+----------+----------------+

Jaro similarity

data = [
    ("jellyfish", "smellyfish"),
    ("li", "lee"),
    ("luisa", "bruna"),
    ("hi", "colombia"),
    (None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("jaro_similarity", ceja.jaro_similarity(col("word1"), col("word2")))
actual_df.show()
+---------+----------+---------------+
|    word1|     word2|jaro_similarity|
+---------+----------+---------------+
|jellyfish|smellyfish|      0.8962963|
|       li|       lee|      0.6111111|
|    luisa|     bruna|            0.6|
|       hi|  colombia|            0.0|
|     null|      null|           null|
+---------+----------+---------------+

Jaro Winkler similarity

data = [
    ("jellyfish", "smellyfish"),
    ("li", "lee"),
    ("luisa", "bruna"),
    (None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("jaro_winkler_similarity", ceja.jaro_winkler_similarity(col("word1"), col("word2")))
actual_df.show()
+---------+----------+-----------------------+
|    word1|     word2|jaro_winkler_similarity|
+---------+----------+-----------------------+
|jellyfish|smellyfish|              0.8962963|
|       li|       lee|              0.6111111|
|    luisa|     bruna|                    0.6|
|     null|      null|                   null|
+---------+----------+-----------------------+

Match rating comparison

data = [
    ("mat", "matt"),
    ("there", "their"),
    ("luisa", "bruna"),
    (None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("match_rating_comparison", ceja.match_rating_comparison(col("word1"), col("word2")))
actual_df.show()
+-----+-----+-----------------------+
|word1|word2|match_rating_comparison|
+-----+-----+-----------------------+
|  mat| matt|                   true|
|there|their|                   true|
|luisa|bruna|                  false|
| null| null|                   null|
+-----+-----+-----------------------+

Contributing

Contributions are welcome and encouraged. Feel free to open issues or send pull requests.

If you make a lot of good contributions, you'll be granted push access to the repo.

The best contributions to make would be implementing these functions as Spark native functions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].