moj-analytical-services / splink

Licence: MIT License

Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters

Programming Languages

Roff

2310 projects

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to splink

record-linkage-resources

Resources for tackling record linkage / deduplication / data matching problems

Stars: ✭ 67 (-62.98%)

Mutual labels: record-linkage, entity-resolution, deduplication, data-matching

entity-embed

PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Stars: ✭ 96 (-46.96%)

Mutual labels: record-linkage, entity-resolution, deduplication, data-matching

zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Stars: ✭ 655 (+261.88%)

Mutual labels: entity-resolution, fuzzy-matching, deduplication

Data Matching Software

A list of free data matching and record linkage software.

Stars: ✭ 206 (+13.81%)

Mutual labels: fuzzy-matching, deduplication

Talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

Stars: ✭ 584 (+222.65%)

Mutual labels: fuzzy-matching, deduplication

snowman

Welcome to Snowman App – a Data Matching Benchmark Platform.

Stars: ✭ 25 (-86.19%)

Mutual labels: entity-resolution, data-matching

stance

Learned string similarity for entity names using optimal transport.

Stars: ✭ 27 (-85.08%)

Mutual labels: record-linkage, entity-resolution

Libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

Stars: ✭ 3,312 (+1729.83%)

Mutual labels: record-linkage, deduplication

Dedupe

🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Stars: ✭ 3,241 (+1690.61%)

Mutual labels: record-linkage, entity-resolution

Spark Lucenerdd

Spark RDD with Lucene's query and entity linkage capabilities

Stars: ✭ 114 (-37.02%)

Mutual labels: spark, deduplication

Merge-Machine

Merge Dirty Data with Clean Reference Tables

Stars: ✭ 35 (-80.66%)

Mutual labels: record-linkage, entity-resolution

whatis

WhatIs.this: simple entity resolution through Wikipedia

Stars: ✭ 18 (-90.06%)

Mutual labels: entity-resolution

fish-fzy

fzy inegration with fish. Search history, navigate directories and more. Blazingly fast.

Stars: ✭ 18 (-90.06%)

Mutual labels: fuzzy-matching

fuzzy-match

Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.

Stars: ✭ 31 (-82.87%)

Mutual labels: fuzzy-matching

yadf

Yet Another Dupes Finder

Stars: ✭ 32 (-82.32%)

Mutual labels: deduplication

spark-lucenerdd-examples

Examples of spark-lucenerdd

Stars: ✭ 15 (-91.71%)

Mutual labels: record-linkage

fuzzy-search

A collection of algorithms for fuzzy search like in Sublime Text.

Stars: ✭ 49 (-72.93%)

Mutual labels: fuzzy-matching

cargo-limit

Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.

Stars: ✭ 105 (-41.99%)

Mutual labels: deduplication

visualize-data-with-python

A Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.

Stars: ✭ 60 (-66.85%)

Mutual labels: spark

machine learning

Stars: ✭ 29 (-83.98%)

Mutual labels: em-algorithm

View All Similar Projects ➔

✨✨ Note to new users: ✨✨

Version 3 of Splink is in development that will make it simpler and more intuitive to use. It also removes the need for PySpark for smaller data linkages of up to around 1 million records. You can try it by installing a pre-release, or in the new demos here. For new users, it may make sense to work with the new version, because it is quicker to learn. However, note that the new code is not yet fully tested.

Splink: Probabilistic record linkage and deduplication at scale

splink implements Fellegi-Sunter's canonical model of record linkage in Apache Spark, including the EM algorithm to estimate parameters of the model.

It:

Works at much greater scale than current open source implementations (100 million records+).
Runs quickly - with runtimes of less than an hour.
Has a highly transparent methodology; match scores can be easily explained both graphically and in words
Is highly accurate

It is assumed that users of Splink are familiar with the probabilistic record linkage theory, and the Fellegi-Sunter model in particular. A series of interactive articles explores the theory behind Splink.

The statistical model behind splink is the same as that used in the R fastLink package. Accompanying the fastLink package is an academic paper that describes this model. This is the best place to start for users wanting to understand the theory about how splink works.

Data Matching, a book by Peter Christen, is another excellent resource.

Installation

splink is a Python package. It uses the Spark Python API to execute data linking jobs in a Spark cluster. It has been tested in Apache Spark 2.3, 2.4 and 3.1.

Install splink using:

pip install splink

Note that Splink requires pyspark and a working Spark installation. These are not specified as explicit dependencies becuase it is assumed users have an existing pyspark setup they wish to use.

Interactive demo

You can run demos of splink in an interactive Jupyter notebook by clicking the button below:

Documentation

The best documentation is currently a series of demonstrations notebooks in the splink_demos repo.

Other tools in the Splink family

Splink Graph

splink_graph is a graph utility library for use in Apache Spark. It computes graph metrics on the outputs of data linking. The repo is here

Quality assurance of linkage results and identifying false positive links
Computing quality metrics associated with groups (clusters) of linked records
Automatically identifying possible false positive links in clusters

Splink Comparison Viewer

splink_comparison_viewer produces interactive dashboards that help you rapidly understand and quality assure the outputs of record linkage. A tutorial video is available here.

Splink Cluster Studio

splink_cluster_studio creates an interactive html dashboard from Splink output that allows you to visualise and analyse a sample of clusters from your record linkage. The repo is here.

Splink Synthetic Data

This code is able to generate realistic test datasets for linkage using the WikiData Query Service.

It has been used to performance test the accuracy of various Splink models.

Interactive settings editor with autocomplete

We also provide an interactive splink settings editor and example settings here.

Starting parameter generation tools

A tool to generate custom m and u probabilities can be found here.

Blog

You can read a short blog post about splink here.

Videos

You can find an introductory video showcasing Splink's features and running through an demo of functionality here.

How to make changes to Splink

(Steps 5 onwards for repo admins only)

Raise new issue or target existing issue
Create new branch (usually off master). Or fork for external contributors.
Make changes, commit and push to GitHub
Make pull request, referencing the issue
Wait for tests to pass
Review pull request
Bump Splink version in pyproject.toml and update CHANGELOG.md as part of pull request
Merge
Create tagged release on Github. This will trigger autopublish to PyPi

Acknowledgements

We are very grateful to ADR UK (Administrative Data Research UK) for providing funding for this work as part of the Data First project.

We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

moj-analytical-services / splink

Programming Languages

Labels

Projects that are alternatives of or similar to splink

Splink: Probabilistic record linkage and deduplication at scale

Installation

Interactive demo

Documentation

Other tools in the Splink family

Splink Graph

Splink Comparison Viewer

Splink Cluster Studio

Splink Synthetic Data

Interactive settings editor with autocomplete

Starting parameter generation tools

Blog

Videos

How to make changes to Splink

Acknowledgements