All Projects → splink → Similar Projects or Alternatives

478 Open source projects that are alternatives of or similar to splink

record-linkage-resources
Resources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (-62.98%)
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (-46.96%)
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+261.88%)
Talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+222.65%)
Mutual labels:  fuzzy-matching, deduplication
Data Matching Software
A list of free data matching and record linkage software.
Stars: ✭ 206 (+13.81%)
Mutual labels:  fuzzy-matching, deduplication
stance
Learned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-85.08%)
Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+1729.83%)
Mutual labels:  record-linkage, deduplication
snowman
Welcome to Snowman App – a Data Matching Benchmark Platform.
Stars: ✭ 25 (-86.19%)
Mutual labels:  entity-resolution, data-matching
Merge-Machine
Merge Dirty Data with Clean Reference Tables
Stars: ✭ 35 (-80.66%)
Dedupe
🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+1690.61%)
Spark Lucenerdd
Spark RDD with Lucene's query and entity linkage capabilities
Stars: ✭ 114 (-37.02%)
Mutual labels:  spark, deduplication
fuzzy-match
Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.
Stars: ✭ 31 (-82.87%)
Mutual labels:  fuzzy-matching
spaczz
Fuzzy matching and more functionality for spaCy.
Stars: ✭ 215 (+18.78%)
Mutual labels:  fuzzy-matching
levenshtein.c
Levenshtein algorithm in C
Stars: ✭ 77 (-57.46%)
Mutual labels:  fuzzy-matching
Spark Jobserver
REST job server for Apache Spark
Stars: ✭ 2,748 (+1418.23%)
Mutual labels:  spark
fuzzy-search
A collection of algorithms for fuzzy search like in Sublime Text.
Stars: ✭ 49 (-72.93%)
Mutual labels:  fuzzy-matching
cargo-limit
Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Stars: ✭ 105 (-41.99%)
Mutual labels:  deduplication
Spark Fast Tests
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Stars: ✭ 249 (+37.57%)
Mutual labels:  spark
Hyperspace
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Stars: ✭ 246 (+35.91%)
Mutual labels:  spark
stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (-66.85%)
Mutual labels:  fuzzy-matching
Dpark
Python clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+1374.03%)
Mutual labels:  spark
Video Stream Analytics
Stars: ✭ 240 (+32.6%)
Mutual labels:  spark
visualize-data-with-python
A Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.
Stars: ✭ 60 (-66.85%)
Mutual labels:  spark
Hadoop Docker
基于Docker构建的Hadoop开发测试环境,包含Hadoop,Hive,HBase,Spark
Stars: ✭ 238 (+31.49%)
Mutual labels:  spark
zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (-52.49%)
Mutual labels:  deduplication
Azure Event Hubs
☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs
Stars: ✭ 233 (+28.73%)
Mutual labels:  spark
tsa4
R code for Time Series Analysis and Its Applications, Ed 4
Stars: ✭ 108 (-40.33%)
Mutual labels:  em-algorithm
fish-fzy
fzy inegration with fish. Search history, navigate directories and more. Blazingly fast.
Stars: ✭ 18 (-90.06%)
Mutual labels:  fuzzy-matching
nomenklatura
Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources
Stars: ✭ 158 (-12.71%)
Mutual labels:  deduplication
whatis
WhatIs.this: simple entity resolution through Wikipedia
Stars: ✭ 18 (-90.06%)
Mutual labels:  entity-resolution
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+1581.77%)
Mutual labels:  spark
yadf
Yet Another Dupes Finder
Stars: ✭ 32 (-82.32%)
Mutual labels:  deduplication
Every Single Day I Tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+37.57%)
Mutual labels:  spark
spark-lucenerdd-examples
Examples of spark-lucenerdd
Stars: ✭ 15 (-91.71%)
Mutual labels:  record-linkage
Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+36.46%)
Mutual labels:  spark
ml
machine learning
Stars: ✭ 29 (-83.98%)
Mutual labels:  em-algorithm
Neo4j Spark Connector
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
Stars: ✭ 245 (+35.36%)
Mutual labels:  spark
Installations mac ubuntu windows
Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).
Stars: ✭ 231 (+27.62%)
Mutual labels:  spark
Recommendationsystem
Book recommender system using collaborative filtering based on Spark
Stars: ✭ 244 (+34.81%)
Mutual labels:  spark
Yoyo-leaf
Yoyo-leaf is an awesome command-line fuzzy finder.
Stars: ✭ 49 (-72.93%)
Mutual labels:  fuzzy-matching
Spark.fish
▁▂▄▆▇█▇▆▄▂▁
Stars: ✭ 229 (+26.52%)
Mutual labels:  spark
Ruby Spark
Ruby wrapper for Apache Spark
Stars: ✭ 221 (+22.1%)
Mutual labels:  spark
Mastering Spark Sql Book
The Internals of Spark SQL
Stars: ✭ 234 (+29.28%)
Mutual labels:  spark
fuzzychinese
A small package to fuzzy match chinese words
Stars: ✭ 50 (-72.38%)
Mutual labels:  fuzzy-matching
Mydatascienceportfolio
Applying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (+25.41%)
Mutual labels:  spark
fuzzy-matcher
Fuzzy Matching Library for Rust
Stars: ✭ 140 (-22.65%)
Mutual labels:  fuzzy-matching
Spark Workshop
Apache Spark™ and Scala Workshops
Stars: ✭ 224 (+23.76%)
Mutual labels:  spark
fuzzywuzzy
Fuzzy string matching for PHP
Stars: ✭ 60 (-66.85%)
Mutual labels:  fuzzy-matching
conciliator
OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.
Stars: ✭ 95 (-47.51%)
Mutual labels:  entity-resolution
Sagemaker Spark
A Spark library for Amazon SageMaker.
Stars: ✭ 219 (+20.99%)
Mutual labels:  spark
IntraArchiveDeduplicator
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Stars: ✭ 87 (-51.93%)
Mutual labels:  deduplication
mail-deduplicate
📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (-25.97%)
Mutual labels:  deduplication
Spark Excel
A Spark plugin for reading Excel files via Apache POI
Stars: ✭ 216 (+19.34%)
Mutual labels:  spark
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+19.34%)
Mutual labels:  spark
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-90.06%)
Mutual labels:  deduplication
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+18.78%)
Mutual labels:  spark
Hydro Serving
MLOps Platform
Stars: ✭ 213 (+17.68%)
Mutual labels:  spark
machine-learning
Python machine learning applications in image processing, recommender system, matrix completion, netflix problem and algorithm implementations including Co-clustering, Funk SVD, SVD++, Non-negative Matrix Factorization, Koren Neighborhood Model, Koren Integrated Model, Dawid-Skene, Platt-Burges, Expectation Maximization, Factor Analysis, ISTA, F…
Stars: ✭ 91 (-49.72%)
Mutual labels:  em-algorithm
deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Stars: ✭ 59 (-67.4%)
Mutual labels:  deduplication
Example Spark
Spark, Spark Streaming and Spark SQL unit testing strategies
Stars: ✭ 205 (+13.26%)
Mutual labels:  spark
1-60 of 478 similar projects