record-linkage-resourcesResources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (-62.98%)
entity-embedPyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (-46.96%)
zinggScalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+261.88%)
TalismanStraightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+222.65%)
stanceLearned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-85.08%)
LibpostalA C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+1729.83%)
snowmanWelcome to Snowman App – a Data Matching Benchmark Platform.
Stars: ✭ 25 (-86.19%)
Merge-MachineMerge Dirty Data with Clean Reference Tables
Stars: ✭ 35 (-80.66%)
Dedupe🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+1690.61%)
Spark LucenerddSpark RDD with Lucene's query and entity linkage capabilities
Stars: ✭ 114 (-37.02%)
fuzzy-matchLibrary and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.
Stars: ✭ 31 (-82.87%)
spaczzFuzzy matching and more functionality for spaCy.
Stars: ✭ 215 (+18.78%)
fuzzy-searchA collection of algorithms for fuzzy search like in Sublime Text.
Stars: ✭ 49 (-72.93%)
cargo-limitCargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Stars: ✭ 105 (-41.99%)
Spark Fast TestsApache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Stars: ✭ 249 (+37.57%)
HyperspaceAn open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Stars: ✭ 246 (+35.91%)
stringdistanceA fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (-66.85%)
DparkPython clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+1374.03%)
visualize-data-with-pythonA Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.
Stars: ✭ 60 (-66.85%)
Hadoop Docker基于Docker构建的Hadoop开发测试环境,包含Hadoop,Hive,HBase,Spark
Stars: ✭ 238 (+31.49%)
zpaqfranzDeduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (-52.49%)
Azure Event Hubs☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs
Stars: ✭ 233 (+28.73%)
tsa4R code for Time Series Analysis and Its Applications, Ed 4
Stars: ✭ 108 (-40.33%)
fish-fzyfzy inegration with fish. Search history, navigate directories and more. Blazingly fast.
Stars: ✭ 18 (-90.06%)
nomenklaturaFramework and command-line tools for integrating FollowTheMoney data streams from multiple sources
Stars: ✭ 158 (-12.71%)
whatisWhatIs.this: simple entity resolution through Wikipedia
Stars: ✭ 18 (-90.06%)
KoalasKoalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+1581.77%)
yadfYet Another Dupes Finder
Stars: ✭ 32 (-82.32%)
Every Single Day I TldrA daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+37.57%)
Data AcceleratorData Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+36.46%)
mlmachine learning
Stars: ✭ 29 (-83.98%)
Neo4j Spark ConnectorNeo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
Stars: ✭ 245 (+35.36%)
RecommendationsystemBook recommender system using collaborative filtering based on Spark
Stars: ✭ 244 (+34.81%)
Yoyo-leafYoyo-leaf is an awesome command-line fuzzy finder.
Stars: ✭ 49 (-72.93%)
Ruby SparkRuby wrapper for Apache Spark
Stars: ✭ 221 (+22.1%)
fuzzychineseA small package to fuzzy match chinese words
Stars: ✭ 50 (-72.38%)
MydatascienceportfolioApplying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (+25.41%)
fuzzy-matcherFuzzy Matching Library for Rust
Stars: ✭ 140 (-22.65%)
Spark WorkshopApache Spark™ and Scala Workshops
Stars: ✭ 224 (+23.76%)
fuzzywuzzyFuzzy string matching for PHP
Stars: ✭ 60 (-66.85%)
conciliatorOpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.
Stars: ✭ 95 (-47.51%)
Sagemaker SparkA Spark library for Amazon SageMaker.
Stars: ✭ 219 (+20.99%)
IntraArchiveDeduplicatorTool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Stars: ✭ 87 (-51.93%)
mail-deduplicate📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (-25.97%)
Spark ExcelA Spark plugin for reading Excel files via Apache POI
Stars: ✭ 216 (+19.34%)
GimelBig Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+19.34%)
Neural-Scam-ArtistWeb Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-90.06%)
SparkrdmaRDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+18.78%)
machine-learningPython machine learning applications in image processing, recommender system, matrix completion, netflix problem and algorithm implementations including Co-clustering, Funk SVD, SVD++, Non-negative Matrix Factorization, Koren Neighborhood Model, Koren Integrated Model, Dawid-Skene, Platt-Burges, Expectation Maximization, Factor Analysis, ISTA, F…
Stars: ✭ 91 (-49.72%)
deduplicationFast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Stars: ✭ 59 (-67.4%)
Example SparkSpark, Spark Streaming and Spark SQL unit testing strategies
Stars: ✭ 205 (+13.26%)