All Projects → splink → Similar Projects or Alternatives

478 Open source projects that are alternatives of or similar to splink

Resources for tackling record linkage / deduplication / data matching problems

Stars: ✭ 67 (-62.98%)

Mutual labels: record-linkage, entity-resolution, deduplication, data-matching

PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Stars: ✭ 96 (-46.96%)

Mutual labels: record-linkage, entity-resolution, deduplication, data-matching

zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Stars: ✭ 655 (+261.88%)

Mutual labels: entity-resolution, fuzzy-matching, deduplication

Talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

Stars: ✭ 584 (+222.65%)

Mutual labels: fuzzy-matching, deduplication

Data Matching Software

A list of free data matching and record linkage software.

Stars: ✭ 206 (+13.81%)

Mutual labels: fuzzy-matching, deduplication

stance

Learned string similarity for entity names using optimal transport.

Stars: ✭ 27 (-85.08%)

Mutual labels: record-linkage, entity-resolution

Libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

Stars: ✭ 3,312 (+1729.83%)

Mutual labels: record-linkage, deduplication

snowman

Welcome to Snowman App – a Data Matching Benchmark Platform.

Stars: ✭ 25 (-86.19%)

Mutual labels: entity-resolution, data-matching

Merge-Machine

Merge Dirty Data with Clean Reference Tables

Stars: ✭ 35 (-80.66%)

Mutual labels: record-linkage, entity-resolution

Dedupe

🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Stars: ✭ 3,241 (+1690.61%)

Mutual labels: record-linkage, entity-resolution

Spark Lucenerdd

Spark RDD with Lucene's query and entity linkage capabilities

Stars: ✭ 114 (-37.02%)

Mutual labels: spark, deduplication

fuzzy-match

Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.

Stars: ✭ 31 (-82.87%)

Mutual labels: fuzzy-matching

spaczz

Fuzzy matching and more functionality for spaCy.

Stars: ✭ 215 (+18.78%)

Mutual labels: fuzzy-matching

levenshtein.c

Levenshtein algorithm in C

Stars: ✭ 77 (-57.46%)

Mutual labels: fuzzy-matching

Spark Jobserver

REST job server for Apache Spark

Stars: ✭ 2,748 (+1418.23%)

Mutual labels: spark

fuzzy-search

A collection of algorithms for fuzzy search like in Sublime Text.

Stars: ✭ 49 (-72.93%)

Mutual labels: fuzzy-matching

cargo-limit

Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.

Stars: ✭ 105 (-41.99%)

Mutual labels: deduplication

Spark Fast Tests

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

Stars: ✭ 249 (+37.57%)

Mutual labels: spark

Hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

Stars: ✭ 246 (+35.91%)

Mutual labels: spark

stringdistance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Stars: ✭ 60 (-66.85%)

Mutual labels: fuzzy-matching

Dpark

Python clone of Spark, a MapReduce alike framework in Python

Stars: ✭ 2,668 (+1374.03%)

Mutual labels: spark

Video Stream Analytics

Stars: ✭ 240 (+32.6%)

Mutual labels: spark

visualize-data-with-python

A Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.

Stars: ✭ 60 (-66.85%)

Mutual labels: spark

Hadoop Docker

基于Docker构建的Hadoop开发测试环境，包含Hadoop，Hive，HBase，Spark

Stars: ✭ 238 (+31.49%)

Mutual labels: spark

zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix

Stars: ✭ 86 (-52.49%)

Mutual labels: deduplication

Azure Event Hubs

☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs

Stars: ✭ 233 (+28.73%)

Mutual labels: spark

tsa4

R code for Time Series Analysis and Its Applications, Ed 4

Stars: ✭ 108 (-40.33%)

Mutual labels: em-algorithm

fish-fzy

fzy inegration with fish. Search history, navigate directories and more. Blazingly fast.

Stars: ✭ 18 (-90.06%)

Mutual labels: fuzzy-matching

nomenklatura

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

Stars: ✭ 158 (-12.71%)

Mutual labels: deduplication

whatis

WhatIs.this: simple entity resolution through Wikipedia

Stars: ✭ 18 (-90.06%)

Mutual labels: entity-resolution

Koalas

Koalas: pandas API on Apache Spark

Stars: ✭ 3,044 (+1581.77%)

Mutual labels: spark

yadf

Yet Another Dupes Finder

Stars: ✭ 32 (-82.32%)

Mutual labels: deduplication

Every Single Day I Tldr

A daily digest of the articles or videos I've found interesting, that I want to share with you.

Stars: ✭ 249 (+37.57%)

Mutual labels: spark

spark-lucenerdd-examples

Examples of spark-lucenerdd

Stars: ✭ 15 (-91.71%)

Mutual labels: record-linkage

Data Accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Stars: ✭ 247 (+36.46%)

Mutual labels: spark

machine learning

Stars: ✭ 29 (-83.98%)

Mutual labels: em-algorithm

Neo4j Spark Connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

Stars: ✭ 245 (+35.36%)

Mutual labels: spark

Installations mac ubuntu windows

Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).

Stars: ✭ 231 (+27.62%)

Mutual labels: spark

Recommendationsystem

Book recommender system using collaborative filtering based on Spark

Stars: ✭ 244 (+34.81%)

Mutual labels: spark

Yoyo-leaf

Yoyo-leaf is an awesome command-line fuzzy finder.

Stars: ✭ 49 (-72.93%)

Mutual labels: fuzzy-matching

Spark.fish

▁▂▄▆▇█▇▆▄▂▁

Stars: ✭ 229 (+26.52%)

Mutual labels: spark

Ruby Spark

Ruby wrapper for Apache Spark

Stars: ✭ 221 (+22.1%)

Mutual labels: spark

Mastering Spark Sql Book

The Internals of Spark SQL

Stars: ✭ 234 (+29.28%)

Mutual labels: spark

fuzzychinese

A small package to fuzzy match chinese words

Stars: ✭ 50 (-72.38%)

Mutual labels: fuzzy-matching

Mydatascienceportfolio

Applying Data Science and Machine Learning to Solve Real World Business Problems

Stars: ✭ 227 (+25.41%)

Mutual labels: spark

fuzzy-matcher

Fuzzy Matching Library for Rust

Stars: ✭ 140 (-22.65%)

Mutual labels: fuzzy-matching

Spark Workshop

Apache Spark™ and Scala Workshops

Stars: ✭ 224 (+23.76%)

Mutual labels: spark

fuzzywuzzy

Fuzzy string matching for PHP

Stars: ✭ 60 (-66.85%)

Mutual labels: fuzzy-matching

conciliator

OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.

Stars: ✭ 95 (-47.51%)

Mutual labels: entity-resolution

Sagemaker Spark

A Spark library for Amazon SageMaker.

Stars: ✭ 219 (+20.99%)

Mutual labels: spark

IntraArchiveDeduplicator

Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.

Stars: ✭ 87 (-51.93%)

Mutual labels: deduplication

mail-deduplicate

📧 CLI to deduplicate mails from mail boxes.

Stars: ✭ 134 (-25.97%)

Mutual labels: deduplication

Spark Excel

A Spark plugin for reading Excel files via Apache POI

Stars: ✭ 216 (+19.34%)

Mutual labels: spark

Gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Stars: ✭ 216 (+19.34%)

Mutual labels: spark

Neural-Scam-Artist

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Stars: ✭ 18 (-90.06%)

Mutual labels: deduplication

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Stars: ✭ 215 (+18.78%)

Mutual labels: spark

Hydro Serving

MLOps Platform

Stars: ✭ 213 (+17.68%)

Mutual labels: spark

machine-learning

Python machine learning applications in image processing, recommender system, matrix completion, netflix problem and algorithm implementations including Co-clustering, Funk SVD, SVD++, Non-negative Matrix Factorization, Koren Neighborhood Model, Koren Integrated Model, Dawid-Skene, Platt-Burges, Expectation Maximization, Factor Analysis, ISTA, F…

Stars: ✭ 91 (-49.72%)

Mutual labels: em-algorithm

deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.

Stars: ✭ 59 (-67.4%)

Mutual labels: deduplication

Example Spark

Spark, Spark Streaming and Spark SQL unit testing strategies

Stars: ✭ 205 (+13.26%)

Mutual labels: spark

1-60 of 478 similar projects

›

next*5