All Projects → grouparoo → sync-engine-example

grouparoo / sync-engine-example

Licence: other
Synchronization Algorithm Exploration: Techniques to synchronize a SQL database with external destinations.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to sync-engine-example

google-sheets-etl
Live import all your Google Sheets to your data warehouse
Stars: ✭ 15 (-11.76%)
Mutual labels:  etl
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+3752.94%)
Mutual labels:  etl
polygon-etl
ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+211.76%)
Mutual labels:  etl
xToBatConverter
Generate a ms batch file and inject a files inside of it. When the batch is executed, the files are extracted and executed.
Stars: ✭ 17 (+0%)
Mutual labels:  batch
csv-cruncher
Treats CSV and JSON files as SQL tables, and exports SQL SELECTs back to CSV or JSON.
Stars: ✭ 32 (+88.24%)
Mutual labels:  etl
proc-that
proc(ess)-that - easy extendable ETL tool for Node.js. Written in TypeScript.
Stars: ✭ 25 (+47.06%)
Mutual labels:  etl
OGMNeo
[No Maintenance] Neo4j nodeJS OGM(object-graph mapping) abstraction layer
Stars: ✭ 54 (+217.65%)
Mutual labels:  batch
blockchain-etl-streaming
Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
Stars: ✭ 57 (+235.29%)
Mutual labels:  etl
zdh web
大数据采集,抽取平台
Stars: ✭ 292 (+1617.65%)
Mutual labels:  etl
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+129.41%)
Mutual labels:  etl
morph-kgc
Powerful RDF Knowledge Graph Generation with [R2]RML Mappings
Stars: ✭ 77 (+352.94%)
Mutual labels:  etl
django-data-migration
Data migration framework for Django that migrates legacy data into your new django app
Stars: ✭ 18 (+5.88%)
Mutual labels:  etl
Windows-10-tweaks
This repo contains multiple scripts to optimize windows 10
Stars: ✭ 37 (+117.65%)
Mutual labels:  batch
naas
⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (+1188.24%)
Mutual labels:  etl
terraform-scheduled-batch-job
A Terraform module representing a scheduled Batch job
Stars: ✭ 22 (+29.41%)
Mutual labels:  batch
YaEtl
Yet Another ETL in PHP
Stars: ✭ 60 (+252.94%)
Mutual labels:  etl
mongoose-plugin-cache
The Perfect Marriage of MongoDB and Redis
Stars: ✭ 42 (+147.06%)
Mutual labels:  batch
easy qsub
Easily submitting multiple PBS jobs or running local jobs in parallel. Multiple input files supported.
Stars: ✭ 26 (+52.94%)
Mutual labels:  batch
starlake
Starlake is a Spark Based On Premise and Cloud ELT/ETL Framework for Batch & Stream Processing
Stars: ✭ 16 (-5.88%)
Mutual labels:  etl
aly
Command Line Alias Manager and Plugin System - Written in Golang
Stars: ✭ 21 (+23.53%)
Mutual labels:  batch

Sync Engine Example

This repo implements a few algorithms that are made to synchronize changes to a SQL database table to an external destination as described in this blog post.

This is interesting because you might want to monitor your users table for changes and do something as they happen. For example, update them in your data warehouse or Mailchimp.

If you don't want to worry about these kinds of details and just make those use cases happen in a much more fully-featured way, check out Grouparoo.

Run it

Yo can run all these tests:

$ npm install
$ npm run all

Test Suites: 2 failed, 3 passed, 5 total
Tests:       4 failed, 36 passed, 40 total
Snapshots:   0 total
Time:        2.238 s

Or run just one algorithm's tests:

$ npm install
$ npm run dbtime

Test Suites: 1 passed, 1 total
Tests:       8 passed, 8 total
Snapshots:   0 total
Time:        0.818 s, estimated 1 s

There are some expected failures because some of the algorithms are not complete enough.

Algorithms

All of the current approaches do delta-based synchronization based on the updatedAt timestamp in the table.

  • simple: A naive current-time-based approach with a few failures.
  • dbtime: Upgrades simple to use the database times, removing race conditions. Might use too much memory.
  • batch: Adds batching to save on memory, but introduces failures because of race conditions with offsets.
  • steps: A hybrid of batch (most of the time) and dbtime (when there are many rows with the same timestamp).
  • secondary: Adds knowledge of a auto-increment ascending column (id) to batch without the offset or memory issues.

Contributing

Is there a test (that should work) that makes some of these fail? That would be great!

The same tests are shared between all the algorithms. Feel free to add a new one.

The current suite a pretty good set of examples. You can use these methods:

  • create: Makes a new id given the primary key. The id has to be ascending within the current suite.
  • update: Updates a row given the id value.
  • stepTime: There is a global clock and this moves it forward. You can't go backwards!
  • expectSync: Runs the algorithm. Fails the test if the given array of rows are not processed as expected.

Feel free to write a new algorithm, too. In general, I wrote a failing test for the current algorithm and then a new algorithm that would fix it.

Other things that are useful to know for edge cases:

  • There is a batchSize that the algorithm use set to 5 here. Use this in your algorithm.

Pictures

I started making some pictures for the blog post.

Simple

Simple algorithm

Batch

Batch algorithm

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].