All Projects → toluaina → Pgsync

toluaina / Pgsync

Licence: lgpl-3.0
Postgres to elasticsearch sync

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Pgsync

Aspnetcorenlog
ASP.NET Core NLog MS SQL Server PostgreSQL MySQL Elasticsearch
Stars: ✭ 54 (-73.66%)
Mutual labels:  sql, postgresql, elasticsearch
Zombodb
Making Postgres and Elasticsearch work together like it's 2021
Stars: ✭ 3,781 (+1744.39%)
Mutual labels:  sql, postgresql, elasticsearch
Pgsodium
Modern cryptography for PostgreSQL using libsodium.
Stars: ✭ 202 (-1.46%)
Mutual labels:  sql, postgresql
Inshop Crm Api
Inshop CRM / ERP API. It's powerful framework allows to build systems for business with different workflows. It has on board multi language support, clients management, projects & tasks, documents, simple accounting, inventory management, orders & invoice management, possibilities to integrate with third party software, REST API, and many other features.
Stars: ✭ 178 (-13.17%)
Mutual labels:  postgresql, elasticsearch
Steampipe
Steampipe command line interface (CLI)
Stars: ✭ 200 (-2.44%)
Mutual labels:  sql, postgresql
Amazonriver
amazonriver 是一个将postgresql的实时数据同步到es或kafka的服务
Stars: ✭ 198 (-3.41%)
Mutual labels:  postgresql, elasticsearch
Sql Battleships
Play Battleships on PostgreSQL
Stars: ✭ 174 (-15.12%)
Mutual labels:  sql, postgresql
Supra Api Nodejs
❤️ Node.js REST API boilerplate
Stars: ✭ 182 (-11.22%)
Mutual labels:  sql, postgresql
Usaspending Api
Server application to serve U.S. federal spending data via a RESTful API
Stars: ✭ 166 (-19.02%)
Mutual labels:  postgresql, elasticsearch
Sqlingvo
A Clojure & ClojureScript DSL for SQL
Stars: ✭ 200 (-2.44%)
Mutual labels:  sql, postgresql
Firecamp
Serverless Platform for the stateful services
Stars: ✭ 194 (-5.37%)
Mutual labels:  postgresql, elasticsearch
Sql exporter
Flexible SQL Exporter for Prometheus
Stars: ✭ 194 (-5.37%)
Mutual labels:  sql, postgresql
Linq2db
Linq to database provider.
Stars: ✭ 2,211 (+978.54%)
Mutual labels:  sql, postgresql
Rom Sql
SQL support for rom-rb
Stars: ✭ 169 (-17.56%)
Mutual labels:  sql, postgresql
Xsql
Unified SQL Analytics Engine Based on SparkSQL
Stars: ✭ 176 (-14.15%)
Mutual labels:  sql, elasticsearch
Sqlcheck
Automatically identify anti-patterns in SQL queries
Stars: ✭ 2,062 (+905.85%)
Mutual labels:  sql, postgresql
Nut
Advanced, Powerful and easy to use ORM for Qt
Stars: ✭ 181 (-11.71%)
Mutual labels:  sql, postgresql
Npgsql
Npgsql is the .NET data provider for PostgreSQL.
Stars: ✭ 2,415 (+1078.05%)
Mutual labels:  sql, postgresql
Pifpaf
Python fixtures and daemon managing tools for functional testing
Stars: ✭ 161 (-21.46%)
Mutual labels:  postgresql, elasticsearch
Neo4j Etl
Data import from relational databases to Neo4j.
Stars: ✭ 165 (-19.51%)
Mutual labels:  sql, postgresql

PGSync

PyPI version Build status Documentation status codecov

PostgreSQL to Elasticsearch sync

PGSync is a middleware for syncing data from Postgres to Elasticsearch effortlessly. It allows you to keep Postgres as your source of truth and expose structured denormalized documents in Elasticsearch.

Changes to nested entities are propagated to Elasticsearch. PGSync's advanced query builder then generates optimized SQL queries on the fly based on your schema. PGSync's advisory model allows you to quickly move and transform large volumes of data quickly whilst maintaining relational integrity.

Simply describe your document structure or schema in JSON and PGSync will continuously capture changes in your data and load it into Elasticsearch without writing any code. PGSync transforms your relational data into a structured document format.

It allows you to take advantage of the expressive power and scalability of Elasticsearch directly from Postgres. You don't have to write complex queries and transformation pipelines. PGSync is lightweight, flexible and fast.

Elasticsearch is more suited as as secondary denormalised search engine to accompany a more traditional normalized datastore. Moreover, you shouldn't store your primary data in Elasticsearch.

So how do you then get your data into Elasticsearch in the first place? Tools like Logstash and Kafka can aid this task but they still require a bit of engineering and development.

Extract Transform Load and Change data capture tools can be complex and require expensive engineering effort.

Other benefits of PGSync include:

  • Real-time analytics
  • Reliable primary datastore/source of truth
  • Scale on-demand
  • Easily join multiple nested tables

PGSync Architecture:

alt text alt text

Why?

At a high level, you have data in a Postgres database and you want to mirror it in Elasticsearch.
This means every change to your data (Insert, Update, Delete and Truncate statements) needs to be replicated to Elasticsearch. At first, this seems easy and then it's not. Simply add some code to copy the data to Elasticsearch after updating the database (or so called dual writes). Writing SQL queries spanning multiple tables and involving multiple relationships are hard to write. Detecting changes within a nested document can also be quite hard. Of course, if your data never changed, then you could just take a snapshot in time and load it into Elasticsearch as a one-off operation.

PGSync is appropriate for you if:

  • Postgres is your read/write source of truth whilst Elasticsearch is your read-only search layer.
  • You need to denormalize relational data into a NoSQL data source.
  • Your data is constantly changing.
  • You have existing data in a relational database such as Postgres and you need a secondary NoSQL database like Elasticsearch for text-based queries or autocomplete queries to mirror the existing data without having your application perform dual writes.
  • You want to keep your existing data untouched whilst taking advantage of the search capabilities of Elasticsearch by exposing a view of your data without compromising the security of your relational data.
  • Or you simply want to expose a view of your relational data for search purposes.

How it works

PGSync is written in Python (supporting version 3.6 onwards) and the stack is composed of: Redis, Elasticsearch, Postgres, and SQlAlchemy.

PGSync leverages the logical decoding feature of Postgres (introduced in PostgreSQL 9.4) to capture a continuous stream of change events. This feature needs to be enabled in your Postgres configuration file by setting in the postgresql.conf file:

> wal_level = logical

You can select any pivot table to be the root of your document.

PGSync's query builder builds advanced queries dynamically against your schema.

PGSync operates in an event-driven model by creating triggers for tables in your database to handle notification events.

This is the only time PGSync will ever make any changes to your database.

NOTE: If you change the structure of your PGSync's schema config, you would need to rebuild your Elasticsearch indices. There are plans to support zero-downtime migrations to streamline this process.

Quickstart

There are several ways of installing and trying PGSync

Running in Docker

To startup all services with docker. Run:

$ docker-compose up

Show the content in Elasticsearch

$ curl -X GET http://[elasticsearch host]:9201/reservations/_search?pretty=true
Manual configuration
  • Setup

    • Ensure the database user is a superuser

    • Enable logical decoding. You would also need to set up at least two parameters at postgresql.conf

      wal_level = logical

      max_replication_slots = 1

  • Installation

    • $ pip install pgsync
    • Create a schema.json for you document representation
    • Bootstrap the database (one time only) bootstrap --config schema.json
    • Run the program with pgsync --config schema.json or as a daemon pgsync --config schema.json -d

Features

Key features of PGSync are:

  • Easily denormalize relational data.
  • Works with any PostgreSQL database (version 9.4 or later).
  • Negligible impact on database performance.
  • Transactionally consistent output in Elasticsearch. This means: writes appear only when they are committed to the database, insert, update and delete operations appear in the same order as they were committed (as opposed to eventual consistency).
  • Fault-tolerant: does not lose data, even if processes crash or a network interruption occurs, etc. The process can be recovered from the last checkpoint.
  • Returns the data directly as Postgres JSON from the database for speed.
  • Supports composite primary and foreign keys.
  • Supports an arbitrary depth of nested entities i.e Tables having long chain of relationship dependencies.
  • Supports Postgres JSON data fields. This means: we can extract JSON fields in a database table as a separate field in the resulting document.
  • Customizable document structure.

Requirements

Example

Consider this example of a Book library database.

Book

isbn (PK) title description
9785811243570 Charlie and the chocolate factory Willy Wonka’s famous chocolate factory is opening at last!
9788374950978 Kafka on the Shore Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami.
9781471331435 1984 1984 was George Orwell’s chilling prophecy about the dystopian future.

Author

id (PK) name
1 Roald Dahl
2 Haruki Murakami
3 Philip Gabriel
4 George Orwell

BookAuthor

id (PK) book_isbn author_id
1 9785811243570 1
2 9788374950978 2
3 9788374950978 3
4 9781471331435 4

With PGSync, we can simply define this JSON schema where the book table is the pivot. A pivot table indicates the root of your document.

{
    "table": "book",
    "columns": [
        "isbn",
        "title",
        "description"
    ],
    "children": [
        {
            "table": "author",
            "columns": [
                "name"
            ]
        }
    ]
}

To get this document structure in Elasticsearch

[
  {
      "isbn": "9785811243570",
      "title": "Charlie and the chocolate factory",
      "description": "Willy Wonka’s famous chocolate factory is opening at last!",
      "authors": ["Roald Dahl"]
  },
  {
      "isbn": "9788374950978",
      "title": "Kafka on the Shore",
      "description": "Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami",
      "authors": ["Haruki Murakami", "Philip Gabriel"]
  },
  {
      "isbn": "9781471331435",
      "title": "1984",
      "description": "1984 was George Orwell’s chilling prophecy about the dystopian future",
      "authors": ["George Orwell"]
  }
]

Behind the scenes, PGSync is generating advanced queries for you such as.

SELECT 
       JSON_BUILD_OBJECT(
          'isbn', book_1.isbn, 
          'title', book_1.title, 
          'description', book_1.description,
          'authors', anon_1.authors
       ) AS "JSON_BUILD_OBJECT_1",
       book_1.id
FROM book AS book_1
LEFT OUTER JOIN
  (SELECT 
          JSON_AGG(anon_2.anon) AS authors,
          book_author_1.book_isbn AS book_isbn
   FROM book_author AS book_author_1
   LEFT OUTER JOIN
     (SELECT 
             author_1.name AS anon,
             author_1.id AS id
      FROM author AS author_1) AS anon_2 ON anon_2.id = book_author_1.author_id
   GROUP BY book_author_1.book_isbn) AS anon_1 ON anon_1.book_isbn = book_1.isbn

You can also configure PGSync to rename attributes via the schema config e.g

  {
      "isbn": "9781471331435",
      "this_is_a_custom_title": "1984",
      "desc": "1984 was George Orwell’s chilling prophecy about the dystopian future",
      "contributors": ["George Orwell"]
  }

PGSync addresses the following challenges:

  • What if we update the author's name in the database?
  • What if we wanted to add another author for an existing book?
  • What if we have lots of documents already with the same author we wanted to change the author name?
  • What if we delete or update an author?
  • What if we truncate an entire table?

Benefits

  • PGSync is a simple to use out of the box solution for Change data capture.
  • PGSync handles data deletions.
  • PGSync requires little development effort. You simply define a schema config describing your data.
  • PGSync generates advanced queries matching your schema directly.
  • PGSync allows you to easily rebuild your indexes in case of a schema change.
  • You can expose only the data you require in Elasticsearch.
  • Supports multiple Postgres schemas for multi-tennant applications.

Contributing

Contributions are very welcome! Check out the Contribution Guidelines for instructions.

Credits

  • This package was created with Cookiecutter
  • Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.

License

This code is released under the GNU Lesser General Public License, version 3.0 (LGPL-3.0).
Please see LICENSE for more details.

You should have received a copy of the GNU Lesser General Public License along with PGSync.
If not, see https://www.gnu.org/licenses/.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].