All Projects → MrPowers → spark-stringmetric

MrPowers / spark-stringmetric

Licence: MIT License
Spark functions to run popular phonetic and string matching algorithms

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to spark-stringmetric

strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+123.53%)
Mutual labels:  jaro-winkler, jaccard-similarity, hamming-distance
ceja
PySpark phonetic and string matching algorithms
Stars: ✭ 24 (-52.94%)
Mutual labels:  jaro-winkler, nysiis, hamming-distance
stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (+17.65%)
Mutual labels:  jaro-winkler, jaccard-similarity, hamming-distance
tika-similarity
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+80.39%)
Mutual labels:  jaccard-similarity, cosine-distance
stringosim
String similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...
Stars: ✭ 47 (-7.84%)
Mutual labels:  jaro-winkler, cosine-distance
Hyperspace
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Stars: ✭ 246 (+382.35%)
Mutual labels:  spark
Text-Similarity
A text similarity computation using minhashing and Jaccard distance on reuters dataset
Stars: ✭ 15 (-70.59%)
Mutual labels:  jaccard-similarity
Dpark
Python clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+5131.37%)
Mutual labels:  spark
eddie
No description or website provided.
Stars: ✭ 18 (-64.71%)
Mutual labels:  jaro-winkler
Video Stream Analytics
Stars: ✭ 240 (+370.59%)
Mutual labels:  spark
lsh-semantic-similarity
Locality Sensitive Hashing for semantic similarity (Python 3.x)
Stars: ✭ 16 (-68.63%)
Mutual labels:  jaccard-similarity
Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+384.31%)
Mutual labels:  spark
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+5868.63%)
Mutual labels:  spark
Neo4j Spark Connector
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
Stars: ✭ 245 (+380.39%)
Mutual labels:  spark
simetric
String similarity metrics for Elixir
Stars: ✭ 59 (+15.69%)
Mutual labels:  jaro-winkler
Recommendationsystem
Book recommender system using collaborative filtering based on Spark
Stars: ✭ 244 (+378.43%)
Mutual labels:  spark
edits.cr
Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-68.63%)
Mutual labels:  jaro-winkler
Spark Jobserver
REST job server for Apache Spark
Stars: ✭ 2,748 (+5288.24%)
Mutual labels:  spark
visualize-data-with-python
A Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.
Stars: ✭ 60 (+17.65%)
Mutual labels:  spark
Every Single Day I Tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+388.24%)
Mutual labels:  spark

spark-stringmetric

CI

String similarity functions and phonetic algorithms for Spark.

See ceja if you're using PySpark.

Project Setup

Update your build.sbt file to import the libraries.

libraryDependencies += "org.apache.commons" % "commons-text" % "1.1"

// Spark 3
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.4.0"

// Spark 2
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.3.0"

You can find the spark-daria Scala 2.11 versions here and the Scala 2.12 versions here.

SimilarityFunctions

  • cosine_distance
  • fuzzy_score
  • hamming
  • jaccard_similarity
  • jaro_winkler

How to import the functions.

import com.github.mrpowers.spark.stringmetric.SimilarityFunctions._

Here's an example on how to use the jaccard_similarity function.

Suppose we have the following sourceDF:

+-------+-------+
|  word1|  word2|
+-------+-------+
|  night|  nacht|
|context|contact|
|   null|  nacht|
|   null|   null|
+-------+-------+

Let's run the jaccard_similarity function.

val actualDF = sourceDF.withColumn(
  "w1_w2_jaccard",
  jaccard_similarity(col("word1"), col("word2"))
)

We can run actualDF.show() to view the w1_w2_jaccard column that's been appended to the DataFrame.

+-------+-------+-------------+
|  word1|  word2|w1_w2_jaccard|
+-------+-------+-------------+
|  night|  nacht|         0.43|
|context|contact|         0.57|
|   null|  nacht|         null|
|   null|   null|         null|
+-------+-------+-------------+

PhoneticAlgorithms

  • double_metaphone
  • nysiis
  • refined_soundex

How to import the functions.

import com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._

Here's an example on how to use the refined_soundex function.

Suppose we have the following sourceDF:

+-----+
|word1|
+-----+
|night|
|  cat|
| null|
+-----+

Let's run the refined_soundex function.

val actualDF = sourceDF.withColumn(
  "word1_refined_soundex",
  refined_soundex(col("word1"))
)

We can run actualDF.show() to view the word1_refined_soundex column that's been appended to the DataFrame.

+-----+---------------------+
|word1|word1_refined_soundex|
+-----+---------------------+
|night|               N80406|
|  cat|                 C306|
| null|                 null|
+-----+---------------------+

API Documentation

Here is the latest API documentation.

Release

  1. Create GitHub tag

  2. Build documentation with sbt ghpagesPushSite

  3. Publish JAR

Run sbt to open the SBT console.

Run > ; + publishSigned; sonatypeBundleRelease to create the JAR files and release them to Maven. These commands are made available by the sbt-sonatype plugin.

After running the release command, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the ~/.sbt/sonatype_credentials file in this format:

realm=Sonatype Nexus Repository Manager
host=oss.sonatype.org
user=$USERNAME
password=$PASSWORD

Post Maven release steps

  • Create a GitHub release/tag
  • Publish the updated documentation
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].