MrPowers / spark-stringmetric

Licence: MIT License

Spark functions to run popular phonetic and string matching algorithms

Programming Languages

scala

5932 projects

Projects that are alternatives of or similar to spark-stringmetric

strutil

Golang metrics for calculating string similarity and other string utility functions

Stars: ✭ 114 (+123.53%)

Mutual labels: jaro-winkler, jaccard-similarity, hamming-distance

ceja

PySpark phonetic and string matching algorithms

Stars: ✭ 24 (-52.94%)

Mutual labels: jaro-winkler, nysiis, hamming-distance

stringdistance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Stars: ✭ 60 (+17.65%)

Mutual labels: jaro-winkler, jaccard-similarity, hamming-distance

tika-similarity

Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.

Stars: ✭ 92 (+80.39%)

Mutual labels: jaccard-similarity, cosine-distance

stringosim

String similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...

Stars: ✭ 47 (-7.84%)

Mutual labels: jaro-winkler, cosine-distance

Hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

Stars: ✭ 246 (+382.35%)

Mutual labels: spark

Text-Similarity

A text similarity computation using minhashing and Jaccard distance on reuters dataset

Stars: ✭ 15 (-70.59%)

Mutual labels: jaccard-similarity

Dpark

Python clone of Spark, a MapReduce alike framework in Python

Stars: ✭ 2,668 (+5131.37%)

Mutual labels: spark

eddie

No description or website provided.

Stars: ✭ 18 (-64.71%)

Mutual labels: jaro-winkler

Video Stream Analytics

Stars: ✭ 240 (+370.59%)

Mutual labels: spark

lsh-semantic-similarity

Locality Sensitive Hashing for semantic similarity (Python 3.x)

Stars: ✭ 16 (-68.63%)

Mutual labels: jaccard-similarity

Data Accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Stars: ✭ 247 (+384.31%)

Mutual labels: spark

Koalas

Koalas: pandas API on Apache Spark

Stars: ✭ 3,044 (+5868.63%)

Mutual labels: spark

Neo4j Spark Connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

Stars: ✭ 245 (+380.39%)

Mutual labels: spark

simetric

String similarity metrics for Elixir

Stars: ✭ 59 (+15.69%)

Mutual labels: jaro-winkler

Recommendationsystem

Book recommender system using collaborative filtering based on Spark

Stars: ✭ 244 (+378.43%)

Mutual labels: spark

edits.cr

Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment

Stars: ✭ 16 (-68.63%)

Mutual labels: jaro-winkler

Spark Jobserver

REST job server for Apache Spark

Stars: ✭ 2,748 (+5288.24%)

Mutual labels: spark

visualize-data-with-python

A Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.

Stars: ✭ 60 (+17.65%)

Mutual labels: spark

Every Single Day I Tldr

A daily digest of the articles or videos I've found interesting, that I want to share with you.

Stars: ✭ 249 (+388.24%)

Mutual labels: spark

View All Similar Projects ➔

spark-stringmetric

String similarity functions and phonetic algorithms for Spark.

See ceja if you're using PySpark.

Project Setup

Update your build.sbt file to import the libraries.

libraryDependencies += "org.apache.commons" % "commons-text" % "1.1"

// Spark 3
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.4.0"

// Spark 2
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.3.0"

You can find the spark-daria Scala 2.11 versions here and the Scala 2.12 versions here.

SimilarityFunctions

cosine_distance
fuzzy_score
hamming
jaccard_similarity
jaro_winkler

How to import the functions.

import com.github.mrpowers.spark.stringmetric.SimilarityFunctions._

Here's an example on how to use the jaccard_similarity function.

Suppose we have the following sourceDF:

+-------+-------+
|  word1|  word2|
+-------+-------+
|  night|  nacht|
|context|contact|
|   null|  nacht|
|   null|   null|
+-------+-------+

Let's run the jaccard_similarity function.

val actualDF = sourceDF.withColumn(
  "w1_w2_jaccard",
  jaccard_similarity(col("word1"), col("word2"))
)

We can run actualDF.show() to view the w1_w2_jaccard column that's been appended to the DataFrame.

+-------+-------+-------------+
|  word1|  word2|w1_w2_jaccard|
+-------+-------+-------------+
|  night|  nacht|         0.43|
|context|contact|         0.57|
|   null|  nacht|         null|
|   null|   null|         null|
+-------+-------+-------------+

PhoneticAlgorithms

double_metaphone
nysiis
refined_soundex

How to import the functions.

import com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._

Here's an example on how to use the refined_soundex function.

Suppose we have the following sourceDF:

+-----+
|word1|
+-----+
|night|
|  cat|
| null|
+-----+

Let's run the refined_soundex function.

val actualDF = sourceDF.withColumn(
  "word1_refined_soundex",
  refined_soundex(col("word1"))
)

We can run actualDF.show() to view the word1_refined_soundex column that's been appended to the DataFrame.

+-----+---------------------+
|word1|word1_refined_soundex|
+-----+---------------------+
|night|               N80406|
|  cat|                 C306|
| null|                 null|
+-----+---------------------+

API Documentation

Here is the latest API documentation.

Release

Create GitHub tag
Build documentation with sbt ghpagesPushSite
Publish JAR

Run sbt to open the SBT console.

Run > ; + publishSigned; sonatypeBundleRelease to create the JAR files and release them to Maven. These commands are made available by the sbt-sonatype plugin.

After running the release command, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the ~/.sbt/sonatype_credentials file in this format:

realm=Sonatype Nexus Repository Manager
host=oss.sonatype.org
user=$USERNAME
password=$PASSWORD

Post Maven release steps

Create a GitHub release/tag
Publish the updated documentation

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

MrPowers / spark-stringmetric

Programming Languages

Labels

Projects that are alternatives of or similar to spark-stringmetric

spark-stringmetric

Project Setup

SimilarityFunctions

PhoneticAlgorithms

API Documentation

Release

Post Maven release steps