All Projects → vintasoftware → entity-embed

vintasoftware / entity-embed

Licence: MIT License
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to entity-embed

record-linkage-resources
Resources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (-30.21%)
Mutual labels:  record-linkage, entity-resolution, deduplication, data-matching
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+88.54%)
Mutual labels:  record-linkage, entity-resolution, deduplication, data-matching
stance
Learned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-71.87%)
Mutual labels:  record-linkage, entity-resolution
Jodie
A PyTorch implementation of ACM SIGKDD 2019 paper "Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks"
Stars: ✭ 172 (+79.17%)
Mutual labels:  embeddings, representation-learning
Dedupe
🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+3276.04%)
Mutual labels:  record-linkage, entity-resolution
Decagon
Graph convolutional neural network for multirelational link prediction
Stars: ✭ 268 (+179.17%)
Mutual labels:  embeddings, representation-learning
Graph 2d cnn
Code and data for the paper 'Classifying Graphs as Images with Convolutional Neural Networks' (new title: 'Graph Classification with 2D Convolutional Neural Networks')
Stars: ✭ 67 (-30.21%)
Mutual labels:  embeddings, representation-learning
TCE
This repository contains the code implementation used in the paper Temporally Coherent Embeddings for Self-Supervised Video Representation Learning (TCE).
Stars: ✭ 51 (-46.87%)
Mutual labels:  embeddings, representation-learning
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+582.29%)
Mutual labels:  entity-resolution, deduplication
image embeddings
Using efficientnet to provide embeddings for retrieval
Stars: ✭ 107 (+11.46%)
Mutual labels:  embeddings, representation-learning
Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+3350%)
Mutual labels:  record-linkage, deduplication
snowman
Welcome to Snowman App – a Data Matching Benchmark Platform.
Stars: ✭ 25 (-73.96%)
Mutual labels:  entity-resolution, data-matching
graphml-tutorials
Tutorials for Machine Learning on Graphs
Stars: ✭ 125 (+30.21%)
Mutual labels:  embeddings, representation-learning
Merge-Machine
Merge Dirty Data with Clean Reference Tables
Stars: ✭ 35 (-63.54%)
Mutual labels:  record-linkage, entity-resolution
dduper
Fast block-level out-of-band BTRFS deduplication tool.
Stars: ✭ 108 (+12.5%)
Mutual labels:  deduplication
Recommender-Systems-with-Collaborative-Filtering-and-Deep-Learning-Techniques
Implemented User Based and Item based Recommendation System along with state of the art Deep Learning Techniques
Stars: ✭ 41 (-57.29%)
Mutual labels:  embeddings
gan tensorflow
Automatic feature engineering using Generative Adversarial Networks using TensorFlow.
Stars: ✭ 48 (-50%)
Mutual labels:  representation-learning
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+95.83%)
Mutual labels:  embeddings
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (-64.58%)
Mutual labels:  embeddings
spark-lucenerdd-examples
Examples of spark-lucenerdd
Stars: ✭ 15 (-84.37%)
Mutual labels:  record-linkage

Entity Embed

PyPi version PyPI - Python Version CI Documentation Status Coverage Status License: MIT

Entity Embed allows you to transform entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Using Entity Embed, you can train a deep learning model to transform records into vectors in an N-dimensional embedding space. Thanks to a contrastive loss, those vectors are organized to keep similar records close and dissimilar records far apart in this embedding space. Embedding records enables scalable ANN search, which means finding thousands of candidate duplicate pairs of records per second per CPU.

Entity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline. A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded records returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier (an example for that is available).

Entity Embed is based on and is a special case of the AutoBlock model described by Amazon.

⚠️ Warning: this project is under heavy development.

Embedding Space Example

Documentation

https://entity-embed.readthedocs.io

Requirements

System

  • MacOS or Linux (tested on latest MacOS and Ubuntu via GitHub Actions).
  • Entity Embed can train and run on a powerful laptop. Tested on a system with 32 GBs of RAM, RTX 2070 Mobile (8 GB VRAM), i7-10750H (12 threads). With batch sizes smaller than 32 and few field types, it's possible to train and run even with 2 GB of VRAM.

Libraries

And others, see requirements.txt.

Installation

pip install entity-embed

For Conda users

If you're using Conda, you must install PyTorch beforehand to have proper CUDA support. Inside the Conda environment, please run the following command before installing Entity Embed using pip:

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

Examples

Run:

pip install -r requirements-examples.txt

Then check the example Jupyter Notebooks:

Colab

Please check notebooks/google-colab/.

Releases

See CHANGELOG.md.

Credits

This project is maintained by open-source contributors and Vinta Software.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Commercial Support

Vinta Software is always looking for exciting work, so if you need any commercial support, feel free to get in touch: [email protected]

References

  • Zhang, W., Wei, H., Sisman, B., Dong, X. L., Faloutsos, C., & Page, D. (2020, January). AutoBlock: A hands-off blocking framework for entity matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 744-752). (pdf)
  • Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., & Cheng, J. (2020, July). Convolutional Embedding for Edit Distance. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 599-608). (pdf) (code)

Citations

If you use Entity Embed in your research, please consider citing it.

BibTeX entry:

@software{entity-embed,
  title = {{Entity Embed}: Scalable Entity Resolution using Approximate Nearest Neighbors.},
  author = {Juvenal, Flávio and Vieira, Renato},
  url = {https://github.com/vintasoftware/entity-embed},
  version = {0.0.6},
  date = {2021-07-16},
  year = {2021}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].