All Projects → entrepreneur-interet-general → Merge-Machine

entrepreneur-interet-general / Merge-Machine

Licence: other
Merge Dirty Data with Clean Reference Tables

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Merge-Machine

Dedupe
🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+9160%)
Mutual labels:  record-linkage, entity-resolution
record-linkage-resources
Resources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (+91.43%)
Mutual labels:  record-linkage, entity-resolution
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (+174.29%)
Mutual labels:  record-linkage, entity-resolution
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+417.14%)
Mutual labels:  record-linkage, entity-resolution
stance
Learned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-22.86%)
Mutual labels:  record-linkage, entity-resolution
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+128.57%)
Mutual labels:  entrepreneur-interet-general
snowman
Welcome to Snowman App – a Data Matching Benchmark Platform.
Stars: ✭ 25 (-28.57%)
Mutual labels:  entity-resolution
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+1771.43%)
Mutual labels:  entity-resolution
Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+9362.86%)
Mutual labels:  record-linkage
spark-lucenerdd-examples
Examples of spark-lucenerdd
Stars: ✭ 15 (-57.14%)
Mutual labels:  record-linkage
conciliator
OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.
Stars: ✭ 95 (+171.43%)
Mutual labels:  entity-resolution
whatis
WhatIs.this: simple entity resolution through Wikipedia
Stars: ✭ 18 (-48.57%)
Mutual labels:  entity-resolution
graph-explorer
Graph Explorer
Stars: ✭ 27 (-22.86%)
Mutual labels:  entrepreneur-interet-general

The Magical CSV Merge Machine

What does it do ?

A python3 library to link a dirty CSV file with a clean reference table. It is meant to as generic as possible and includes a labeller to learn optimal parameters for each matching scenario.

Alt Text

How to install ?

Manual install

Non-Python Requirements

This library relies on Elasticsearch. We used version 5.6.7 for developpment.. We recommend Elasticsearch 5.X. Instructions here.

PIP3 install

pip3 install merge-machine

From source (recommended, for the meantime...):

git clone https://github.com/entrepreneur-interet-general/Merge-Machine.git
cd Merge-Machine
pip3 install -e .

How to use ?

General use example (install the package first...)

See examples/example.py.

Resource creation example (to use advanced analyzers)

See examples/gen_resource_example.py.

Guidelines

See HOW_TO.md.

How it works ?

The reference is indexed in Elasticsearch with multiple analyzers (languages specific, integers, n_grams...). The labeller then proposes training samples from the source which it tries to match to rows of the reference file. Upon user confirmation (match / not match) it updates its belief on which Elasticsearch queries are most performant to use for matching. When labelling is over, the "best query" (a weighted combination of multiple ES queries with different analyzers on different fields) is used for each row of the source to try to find a match in the ES-indexed referential.

How to contribute ?

Feel free to report bugs via issues and make pull requests...

Credits

This library was developped by Léo Bouloc during 10 months in 2017 at the French Ministry of Research and Higher Education in the context of the "Entrepreneur d'Intérêt Général" program funded by the French Government.

See also

This library was developped as a component of larger matching service:

  • ONLINE SERVICE COMING SOON !!!
  • code

Check out:

Other similar libraries include:

  • match_id (Identity record linking)
  • dedupe (Record linking and deduping)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].