All Projects → ropeladder → record-linkage-resources

ropeladder / record-linkage-resources

Licence: other
Resources for tackling record linkage / deduplication / data matching problems

Projects that are alternatives of or similar to record-linkage-resources

splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+170.15%)
Mutual labels:  record-linkage, entity-resolution, deduplication, data-matching
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (+43.28%)
Mutual labels:  record-linkage, entity-resolution, deduplication, data-matching
Dedupe
🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+4737.31%)
Mutual labels:  record-linkage, entity-resolution
snowman
Welcome to Snowman App – a Data Matching Benchmark Platform.
Stars: ✭ 25 (-62.69%)
Mutual labels:  entity-resolution, data-matching
Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+4843.28%)
Mutual labels:  record-linkage, deduplication
stance
Learned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-59.7%)
Mutual labels:  record-linkage, entity-resolution
Merge-Machine
Merge Dirty Data with Clean Reference Tables
Stars: ✭ 35 (-47.76%)
Mutual labels:  record-linkage, entity-resolution
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+877.61%)
Mutual labels:  entity-resolution, deduplication
mail-deduplicate
📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (+100%)
Mutual labels:  deduplication
whatis
WhatIs.this: simple entity resolution through Wikipedia
Stars: ✭ 18 (-73.13%)
Mutual labels:  entity-resolution
deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Stars: ✭ 59 (-11.94%)
Mutual labels:  deduplication
zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (+28.36%)
Mutual labels:  deduplication
acid-store
A library for secure, deduplicated, transactional, and verifiable data storage
Stars: ✭ 48 (-28.36%)
Mutual labels:  deduplication
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-73.13%)
Mutual labels:  deduplication
dduper
Fast block-level out-of-band BTRFS deduplication tool.
Stars: ✭ 108 (+61.19%)
Mutual labels:  deduplication
spark-lucenerdd-examples
Examples of spark-lucenerdd
Stars: ✭ 15 (-77.61%)
Mutual labels:  record-linkage
Frost
A backup program that does deduplication, compression, encryption
Stars: ✭ 25 (-62.69%)
Mutual labels:  deduplication
dedupsqlfs
Deduplicating filesystem via Python3, FUSE and SQLite
Stars: ✭ 24 (-64.18%)
Mutual labels:  deduplication
IntraArchiveDeduplicator
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Stars: ✭ 87 (+29.85%)
Mutual labels:  deduplication
yadf
Yet Another Dupes Finder
Stars: ✭ 32 (-52.24%)
Mutual labels:  deduplication

Record Linkage Resources

Resources for tackling record linkage (also known as deduplication, data matching, entity resolution)

Note: If you're looking for file deduplication software, you're in the wrong place! This page focuses on deduplicating datasets.

Also note: Nor is this page is not about deduplication software used in backup and storage.

Record linkage attempts to identify duplicate records in messy data. It is a thorny problem that crops up in a variety of scenarios that attempt to understand with real-world entities (most often people), such as census and statistical bureaus, medical organizations, the social sciences, and of course commercial business.

For example, are these records the same person? Record Linkage is how you make the computer decide--quickly.

Name Address Phone
Bill Smith 123 N. Main St. 555-1235
Smith, William K. 123 Main -
W. K. Smith North Main Street 222-555-1234
Bill Schmidt 1230 Main St. 542-1235

Background

Documents

Talks

Books

Free software

(last updated, stars)

Python

Java

R

Spark

Other

Commercial software and solutions

For SAS

Data Cleaning

Name Parsers

Python

JavaScript

Papers

Organizations

Misc

To Do

  • list compatible data sources for software (CSV, SQL DB, JSON, data frame, etc...)
  • GUI or not?
  • list algorithms and techniques for softare (deterministic, probabalistic, graph, etc...)

Suggestions / contributions welcome! I am not an expert on record linkage, this is simply a list of things I've found when working on a difficult deduplication problem for Thicket.io.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].