All Projects → zinggAI → zingg

zinggAI / zingg

Licence: AGPL-3.0 license
Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Programming Languages

java
68154 projects - #9 most used programming language
HTML
75241 projects
python
139335 projects - #7 most used programming language
scala
5932 projects
shell
77523 projects
Batchfile
5799 projects

Projects that are alternatives of or similar to zingg

splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (-72.37%)
Mutual labels:  entity-resolution, fuzzy-matching, deduplication
record-linkage-resources
Resources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (-89.77%)
Mutual labels:  entity-resolution, deduplication
naas
⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (-66.56%)
Mutual labels:  etl, data-transformation
yadf
Yet Another Dupes Finder
Stars: ✭ 32 (-95.11%)
Mutual labels:  dedupe, deduplication
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (-85.34%)
Mutual labels:  entity-resolution, deduplication
Data Matching Software
A list of free data matching and record linkage software.
Stars: ✭ 206 (-68.55%)
Mutual labels:  fuzzy-matching, deduplication
Talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (-10.84%)
Mutual labels:  fuzzy-matching, deduplication
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-94.05%)
Mutual labels:  etl, datalake
Restic
Fast, secure, efficient backup program
Stars: ✭ 15,105 (+2206.11%)
Mutual labels:  dedupe, deduplication
Dedupe
🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+394.81%)
Mutual labels:  dedupe, entity-resolution
dduper
Fast block-level out-of-band BTRFS deduplication tool.
Stars: ✭ 108 (-83.51%)
Mutual labels:  dedupe, deduplication
gallia-core
A schema-aware Scala library for data transformation
Stars: ✭ 44 (-93.28%)
Mutual labels:  etl, data-transformation
DQCS
数据质量控制系统
Stars: ✭ 34 (-94.81%)
Mutual labels:  etl, dataquality
mail-deduplicate
📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (-79.54%)
Mutual labels:  dedupe, deduplication
fuzzychinese
A small package to fuzzy match chinese words
Stars: ✭ 50 (-92.37%)
Mutual labels:  fuzzy-matching
django-data-migration
Data migration framework for Django that migrates legacy data into your new django app
Stars: ✭ 18 (-97.25%)
Mutual labels:  etl
FlutterIOT
Visit our website for more Mobile and Web applications
Stars: ✭ 66 (-89.92%)
Mutual labels:  ml
apiary-data-lake
Terraform scripts for deploying Apiary Data Lake
Stars: ✭ 15 (-97.71%)
Mutual labels:  datalake
neptune-client
📒 Experiment tracking tool and model registry
Stars: ✭ 348 (-46.87%)
Mutual labels:  ml
DeepBump
Normal & height maps generation from single pictures
Stars: ✭ 185 (-71.76%)
Mutual labels:  ml

The Problem

Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields, which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs, or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce the right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.

# Zingg - Data Silos

With a modern data stack and DataOps, we have established patterns for E and L in ELT for building data warehouses, datalakes and deltalakes. However, the T - getting data ready for analytics still needs a lot of effort. Modern tools like dbt are actively and successfully addressing this. What is also needed is a quick and scalable way to build the single source of truth of core business entities post Extraction and pre or post Loading.

With Zingg, the analytics engineer and the data scientist can quickly integrate data silos and build unified views at scale!

# Zingg - Data Mastering At Scale with ML

Besides probabilistic matching, also known as fuzzy matching, Zingg also does deterministic matching, which is useful in identity resolution and householding applications.

#Zingg Detereministic Matching

Why Zingg

Zingg is an ML based tool for entity resolution. The following features set Zingg apart from other tools and libraries:

  • Ability to handle any entity like customer, patient, supplier, product etc
  • Ability to connect to disparate data sources. Local and cloud file systems in any format, enterprise applications and relational, NoSQL and cloud databases and warehouses
  • Ability to scale to large volumes of data. See why this is important and Zingg performance numbers
  • Interactive training data builder using active learning that builds models on frugally small training samples to high accuracy. Shows records and asks user to mark yes, no, cant say on the cli.
  • Ability to define domain specific functions to improve matching
  • Out of the box support for English as well as Chinese, Thai, Japanese, Hindi and other languages

Zingg is useful for

  • Building unified and trusted views of customers and suppliers across multiple systems
  • Large Scale Entity Resolution for AML, KYC and other fraud and compliance scenarios
  • Deduplication and data quality
  • Identity Resolution
  • Integrating data silos during mergers and acquisitions
  • Data enrichment from external sources
  • Establishing customer households

Demo

See Zingg in action here

Getting Started

The easiest way to get started with Zingg is through Docker and by running the prebuilt models.

docker pull zingg/zingg:0.3.4
docker run -it zingg/zingg:0.3.4 bash
./scripts/zingg.sh --phase match --conf examples/febrl/config.json

Check the step by step guide for more details.

Connectors

Zingg connects, reads and writes to most on-premise and cloud data sources. Zingg runs on any private or cloud based Spark service. zinggConnectors

Zingg can read and write to Snowflake, Cassandra, S3, Azure, Elastic, major RDBMS and any Spark supported data sources. Zingg also works with all major file formats including Parquet, Avro, JSON, XLSX, CSV & TSV. This is done through the Zingg pipe abstraction.

Key Zingg Concepts

Zingg learns 2 models on the data:

  1. Blocking Model

One fundamental problem with scaling data mastering is that the number of comparisons increase quadratically as the number of input record increases. Data Mastering At Scale

Zingg learns a clustering/blocking model which indexes near similar records. This means that Zingg does not compare every record with every other record. Typical Zingg comparisons are 0.05-1% of the possible problem space.

  1. Similarity Model

The similarity model helps Zingg predict which record pairs match. Similarity is run only on records within the same block/cluster to scale the problem to larger datasets. The similarity model is a classifier which predicts similarity between records that are not exactly the same, but could belong together.

Fuzzy matching comparisons

To build these models, training data is needed. Zingg comes with an interactive learner to rapidly build training sets.

Shows records and asks user to mark yes, no, cant say on the cli.

Pretrained models

Zingg comes with pretrained models for the Febrl dataset under the models folder.

The Story

What is the backstory behind Zingg?

Documentation

Check the detailed Zingg documentation

Community

Be part of the conversation in the Zingg Community Slack

Reporting bugs and contributing

Want to report a bug or request a feature? Let us know on Slack, or open an issue

Want to commit code? Please check the contributing documentation.

Book Office Hours

If you want to schedule a 30-min call with our team to help you understand if Zingg is the right technology for your problem, please book a slot here.

Asking questions on running Zingg

If you have a question or issue while using Zingg, kindly log a question and we will reply very fast :-) Or you can use Slack

License

Zingg is licensed under AGPL v3.0 - which means you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

AGPL allows unresticted use of Zingg by end users and solution builders and partners. We strongly encourage solution builders to create custom solutions for their clients using Zingg.

Need a different license? Write to us.

People behind Zingg

Zingg is being developed by the Zingg.AI team.

Acknowledgements

Zingg would not have been possible without the excellent work below:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].