All Projects → usc-isi-i2 → Rltk

usc-isi-i2 / Rltk

Licence: mit
Record Linkage ToolKit (Find and link entities)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Rltk

Recordlinkage
A toolkit for record linkage and duplicate detection in Python
Stars: ✭ 532 (+649.3%)
Mutual labels:  deduplication, similarity
Fastcdc Rs
FastCDC implementation in Rust
Stars: ✭ 31 (-56.34%)
Mutual labels:  deduplication
Macropodus
自然语言处理工具Macropodus,基于Albert+BiLSTM+CRF深度学习网络架构,中文分词,词性标注,命名实体识别,新词发现,关键词,文本摘要,文本相似度,科学计算器,中文数字阿拉伯数字(罗马数字)转换,中文繁简转换,拼音转换。tookit(tool) of NLP,CWS(chinese word segnment),POS(Part-Of-Speech Tagging),NER(name entity recognition),Find(new words discovery),Keyword(keyword extraction),Summarize(text summarization),Sim(text similarity),Calculate(scientific calculator),Chi2num(chinese number to arabic number)
Stars: ✭ 309 (+335.21%)
Mutual labels:  similarity
Rdedup
Data deduplication engine, supporting optional compression and public key encryption.
Stars: ✭ 690 (+871.83%)
Mutual labels:  deduplication
Final word similarity
综合了同义词词林扩展版与知网(Hownet)的词语相似度计算方法,词汇覆盖更多、结果更准确。
Stars: ✭ 485 (+583.1%)
Mutual labels:  similarity
Jdupes
A powerful duplicate file finder and an enhanced fork of 'fdupes'.
Stars: ✭ 790 (+1012.68%)
Mutual labels:  deduplication
lieu
Dedupe/batch geocode addresses and venues around the world with libpostal
Stars: ✭ 73 (+2.82%)
Mutual labels:  deduplication
Computervision Recipes
Best Practices, code samples, and documentation for Computer Vision.
Stars: ✭ 8,214 (+11469.01%)
Mutual labels:  similarity
Node Damerau Levenshtein
Damerau - Levenstein distance function for node
Stars: ✭ 27 (-61.97%)
Mutual labels:  similarity
Dssim
Image similarity comparison simulating human perception (multiscale SSIM in Rust)
Stars: ✭ 668 (+840.85%)
Mutual labels:  similarity
Talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+722.54%)
Mutual labels:  deduplication
Kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Stars: ✭ 507 (+614.08%)
Mutual labels:  deduplication
Borgmatic
Simple, configuration-driven backup software for servers and workstations
Stars: ✭ 902 (+1170.42%)
Mutual labels:  deduplication
Alertmanager
Prometheus Alertmanager
Stars: ✭ 4,574 (+6342.25%)
Mutual labels:  deduplication
Ml Classify Text Js
Machine learning based text classification in JavaScript using n-grams and cosine similarity
Stars: ✭ 38 (-46.48%)
Mutual labels:  similarity
Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+4564.79%)
Mutual labels:  deduplication
Similarity
similarity:相似度计算工具包,java编写。用于词语、短语、句子、词法分析、情感分析、语义分析等相关的相似度计算。
Stars: ✭ 760 (+970.42%)
Mutual labels:  similarity
Consimilo
A Clojure library for querying large data-sets on similarity
Stars: ✭ 54 (-23.94%)
Mutual labels:  similarity
Rmlint
Extremely fast tool to remove duplicates and other lint from your filesystem
Stars: ✭ 996 (+1302.82%)
Mutual labels:  deduplication
Dupandas
📊 python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Stars: ✭ 20 (-71.83%)
Mutual labels:  deduplication

RLTK: Record Linkage ToolKit

.. begin-intro .. image:: https://img.shields.io/badge/license-MIT-blue.svg :target: https://raw.githubusercontent.com/usc-isi-i2/rltk/master/LICENSE :alt: License

.. image:: https://api.travis-ci.org/usc-isi-i2/rltk.svg?branch=master :target: https://travis-ci.org/usc-isi-i2/rltk :alt: Travis

.. image:: https://badge.fury.io/py/rltk.svg :target: https://badge.fury.io/py/rltk :alt: pypi

.. image:: https://readthedocs.org/projects/rltk/badge/?version=latest :target: http://rltk.readthedocs.io/en/latest :alt: Documents

The Record Linkage ToolKit (RLTK) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity. Record linkage is an extremely important problem that shows up in domains extending from social networks to bibliographic data and biomedicine. Current open platforms for record linkage have problems scaling even to moderately sized datasets, or are just not easy to use (even by experts). RLTK attempts to address all of these issues.

RLTK supports a full, scalable record linkage pipeline, including multi-core algorithms for blocking, profiling data, computing a wide variety of features, and training and applying machine learning classifiers based on Python’s sklearn library. An end-to-end RLTK pipeline can be jump-started with only a few lines of code. However, RLTK is also designed to be extensible and customizable, allowing users arbitrary degrees of control over many of the individual components. You can add new features to RLTK (e.g. a custom string similarity) very easily.

RLTK is being built by the Center on Knowledge Graphs <http://usc-isi-i2.github.io/>_ at USC/ISI <https://isi.edu/>_, with funding from multiple projects funded by the DARPA LORELEI and MEMEX programs and the IARPA CAUSE program. RLTK is under active maintenance and we expect to keep adding new features and state-of-the-art record linkage algorithms in the foreseeable future, in addition to continuously supporting our adopters to integrate the platform into their applications.

Getting Started

Installation (make sure prerequisites are installed)::

pip install -U rltk

Example::

import rltk rltk.levenshtein_distance('abc', 'abd') 1

Try RLTK Online

  • Stable version <https://mybinder.org/v2/gh/usc-isi-i2/rltk/master>_
  • Development version <https://mybinder.org/v2/gh/usc-isi-i2/rltk/dev>_

.. end-intro

Datasets & Experiments

  • rltk-experimentation <https://github.com/usc-isi-i2/rltk-experimentation>_

Documentation

  • Tutorials <http://rltk.readthedocs.io>_
  • API Reference <http://rltk.readthedocs.io/en/latest/modules.html>_
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].