All Projects → J535D165 → Data Matching Software

J535D165 / Data Matching Software

A list of free data matching and record linkage software.

Projects that are alternatives of or similar to Data Matching Software

zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+217.96%)
Mutual labels:  fuzzy-matching, deduplication
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (-12.14%)
Mutual labels:  fuzzy-matching, deduplication
Talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+183.5%)
Mutual labels:  deduplication, fuzzy-matching
Vdo
Userspace tools for managing VDO volumes.
Stars: ✭ 138 (-33.01%)
Mutual labels:  deduplication
Imguifontstudio
Font Helper Gui Tool for programming
Stars: ✭ 149 (-27.67%)
Mutual labels:  software
Geneweb
GeneWeb is an open source genealogy software with a web interface created by Daniel de Rauglaudre.
Stars: ✭ 171 (-16.99%)
Mutual labels:  software
Ibm Z Zos
The helpful and handy location for finding and sharing z/OS files, which are not included in the product.
Stars: ✭ 198 (-3.88%)
Mutual labels:  software
Awesome Podcasts
Collection of awesome podcasts
Stars: ✭ 10,571 (+5031.55%)
Mutual labels:  software
Huster Cs
华中科技大学 计算机科学与技术学院 学习资料💯 以及 实验资料💾
Stars: ✭ 191 (-7.28%)
Mutual labels:  software
Kvdo
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Stars: ✭ 168 (-18.45%)
Mutual labels:  deduplication
Dupeguru
Find duplicate files
Stars: ✭ 2,385 (+1057.77%)
Mutual labels:  deduplication
Dragonfly
Minecraft (Bedrock Edition) server software written in Go
Stars: ✭ 148 (-28.16%)
Mutual labels:  software
Fuzzymatcher
Record linking package that fuzzy matches two Python pandas dataframes using sqlite3 fts4
Stars: ✭ 173 (-16.02%)
Mutual labels:  fuzzy-matching
React Command Palette
An accessible browser compatible javascript command palette
Stars: ✭ 140 (-32.04%)
Mutual labels:  fuzzy-matching
Scrna Tools
Table of software for the analysis of single-cell RNA-seq data.
Stars: ✭ 193 (-6.31%)
Mutual labels:  software
Symspell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Stars: ✭ 1,976 (+859.22%)
Mutual labels:  fuzzy-matching
Linux Soft Exploit Suggester
Search Exploitable Software on Linux
Stars: ✭ 187 (-9.22%)
Mutual labels:  software
Workshops
Workshops organized to introduce students to security, AI, AR/VR, hardware and software
Stars: ✭ 162 (-21.36%)
Mutual labels:  software
Fuzzysearch
Find parts of long text or data, allowing for some changes/typos.
Stars: ✭ 157 (-23.79%)
Mutual labels:  fuzzy-matching
Restic
Fast, secure, efficient backup program
Stars: ✭ 15,105 (+7232.52%)
Mutual labels:  deduplication

Data Matching software

This is a list of (Fuzzy) Data Matching software. The software in this list is open source and/or freely available.

The term data matching is used to indicate the procedure of bringing together information from two or more records that are believed to belong to the same entity. Data matching has two applications: (1) to match data across multiple datasets (linkage) and (2) to match data within a dataset (deduplication). See the Wikipedia page about data matching for more information.

Similar terms: record linkage, data matching, deduplication, fuzzy matching, entity resolution

Overview

The table below gives a dense overview of data matching software properties. The properties evaluated are Application Programming Interface (API), Graphical User Interface (GUI), Linking, Deduplication, Supervised Learning, Unsupervised Learning and Active Learning.

Software API GUI Link Dedup Supervised
Learning
Unsupervised
Learning
Active
Learning
AtyImo PySpark
Dedupe Python
fastLink R
FEBRL Python
FRIL Java
FuzzyMatcher Python
JedAI Java
PRIL SQL
Python Record Linkage Toolkit Python
RecordLinkage (R) R
RELAIS
ReMaDDer
Splink PySpark
The Link King

✅ Yes/Implemented ❌ No/Not implemented ❔ Unknown

Software

This section describes data matching software. The software is alphabetically ordered.

AtyImo

AtyImo implements a mixture of deterministic and probabilistic routines for data linkage. Initially developed in 2013 to serve as a linkage tool supporting a joint Brazil–U.K. project aiming at building a large population-based cohort with data from more than 100 million participants and producing disease-specific data to facilitate diverse epidemiological research studies. MIT Python Spark GitHub stars

Dedupe

Dedupe is a python library for fuzzy matching, deduplication and entity resolution on structured data. The library makes use of active learning to match record pairs. Active learning is useful in cases without training data. Dedupe has a side-product for deduplicating CSV files, csvdedupe, through the command line. Dedupeio also offers commercial products for data matching. [source code] MIT Python GitHub stars PyPI

fastLink

Implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. This includes functionalities to conduct a merge of two datasets under the Fellegi-Sunter model using the Expectation-Maximization algorithm. fastLink is a programming API written in R. (Enamorado, Fifield & Imai, 2017) [source code] GPL-3.0 R GitHub stars CRAN

FEBRL

Febrl (Freely Extensible Biomedical Record Linkage) is a training tool suitable for users to learn and experiment with record linkage techniques, as well as for practitioners to conduct linkages with data sets containing up to several hundred thousand records. Febrl is a data matching tool with a large number of algorithms implemented and offers a Python programming interface as well as simple GUI. Febrl doesn't offer unsupervised and active learning algorithms. The software is no longer actively maintained. (Christen, 2008) [source code] Python

FRIL

FRIL (Fine-grained Records Integration and Linkage tool) is free tool that enables record linkage through a GUI. The tool implements automatic weights estimation through the EM-algorithm and offers serveral techniques to make record pairs. FRIL was developed by the Emory University and is not longer maintained. [source code] Java

FuzzyMatcher

A Python package that allows the user to fuzzy match two pandas dataframes based on one or more fields in common. The functionality is limited at the moment. [source code] MIT Python GitHub stars PyPI

JedAI

Java gEneric DAta Integration (JedAI) Toolkit is a Entity Resolution Tool developed by a group of univeristies. JedAI offers a Graphical User Interface. [source code] Apache License 2.0 Java

PRIL

PRIL (Point-of-contact Interactive Record Linkage) is a record linkage program with a GUI. PRIL can be used to link datasets about individuals. (Rentsch CT, Kabudula CW, Catlett J et al., 2017) [source code] MIT SQLPL GitHub stars

Python Record Linkage Toolkit

The Python Record Linkage Toolkit is a library to link records in or between data sources. The toolkit provides most of the tools needed for record linkage and deduplication. The package is developed for research and the linking of small or medium sized files. [source code] GPL-3.0 Python GitHub stars PyPI

RecordLinkage (R)

Package written in R that provides functions for linking and de-duplicating data sets. Both supervised and unsupervised classification algorithms are available. Record pairs can be compared with a limited set of algorithms. The package is published on CRAN. GPL-3.0 R CRAN

RELAIS

RELAIS (REcord Linkage At IStat) is a toolkit providing a set of techniques for dealing with record linkage projects. IStat is the main producer of official statistics in Italy. EUPL-1.1 R/Java

ReMaDDer

ReMaDDer is unsupervised free fuzzy data matching software with a GUI. ReMaDDer is capable to perform fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of human clerical review. NOTE: The software is free, but not open source and requires an internet connection to work.

Splink

Splink is a Python/PySpark package that implements Fellegi-Sunter's canonical model of record linkage in Apache Spark. It uses the Expectation Maximisation algorithm to estimate parameters of the model. It is able to perform linking and deduplication of very large datasets of tens of millions of records with runtimes of less than an hour. [source code] MIT Python Spark GitHub stars PyPI

The Link King

The Link King’s graphical user interface (GUI) makes record linkage and unduplication easy for beginning and advanced users. The software requires a SAS license. SAS

Outdated/ no longer available

BigMatch (by USA census)

A record linkage tool for use in matching a very large file against a moderate size file developed by the USA Census Bureau. There are several papers available about this program (BigMatch, 2007)

Contributing

Do you know an open source and/or free data matching tool? Please open an issue or do a Pull Request. The same holds for missing or incomplete information.

This project is initiated by the author of the Python Record Linkage Toolkit @J535D165. The aim is to get a list and comparison of data matching software.

This list is licensed under CC-BY-SA 3.0.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].