All Projects → HashDataInc → Bireme

HashDataInc / Bireme

Licence: apache-2.0
Bireme is an incremental synchronization tool for the Greenplum / HashData data warehouse

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Bireme

Devops Bash Tools
550+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Kafka, Docker, APIs, Hadoop, SQL, PostgreSQL, MySQL, Hive, Impala, Travis CI, Jenkins, Concourse, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, .tmux.conf, .psqlrc ...
Stars: ✭ 226 (+105.45%)
Mutual labels:  kafka, mysql, postgresql
Back End Interview
后端面试题汇总(Python、Redis、MySQL、PostgreSQL、Kafka、数据结构、算法、编程、网络)
Stars: ✭ 188 (+70.91%)
Mutual labels:  kafka, mysql, postgresql
Synch
Sync data from the other DB to ClickHouse(cluster)
Stars: ✭ 200 (+81.82%)
Mutual labels:  kafka, mysql, postgresql
Storagetapper
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
Stars: ✭ 232 (+110.91%)
Mutual labels:  kafka, mysql, postgresql
Datafaker
Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具
Stars: ✭ 327 (+197.27%)
Mutual labels:  kafka, mysql, postgresql
Spring Boot 2.x Examples
Spring Boot 2.x code examples
Stars: ✭ 104 (-5.45%)
Mutual labels:  kafka, mysql, postgresql
Symmetric Ds
SymmetricDS is a database and file synchronization solution that is platform-independent, web-enabled, and database agnostic. SymmetricDS was built to make data replication across two to tens of thousands of databases and file systems fast, easy and resilient. We specialize in near real time, bi-directional data replication across large node networks over the WAN or LAN.
Stars: ✭ 450 (+309.09%)
Mutual labels:  mysql, postgresql, synchronization
Pmacct
pmacct is a small set of multi-purpose passive network monitoring tools [NetFlow IPFIX sFlow libpcap BGP BMP RPKI IGP Streaming Telemetry].
Stars: ✭ 677 (+515.45%)
Mutual labels:  kafka, mysql, postgresql
Xeus Sql
xeus-sql is a Jupyter kernel for general SQL implementations.
Stars: ✭ 85 (-22.73%)
Mutual labels:  mysql, postgresql
Graphjin
GraphJin - Build APIs in 5 minutes with GraphQL. An instant GraphQL to SQL compiler.
Stars: ✭ 1,264 (+1049.09%)
Mutual labels:  mysql, postgresql
Xgenecloud
XgeneCloud is now https://github.com/nocodb/nocodb
Stars: ✭ 1,629 (+1380.91%)
Mutual labels:  mysql, postgresql
Gopherus
This tool generates gopher link for exploiting SSRF and gaining RCE in various servers
Stars: ✭ 1,258 (+1043.64%)
Mutual labels:  mysql, postgresql
Chloe
A lightweight and high-performance Object/Relational Mapping(ORM) library for .NET --C#
Stars: ✭ 1,248 (+1034.55%)
Mutual labels:  mysql, postgresql
Haproxy Configs
80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.
Stars: ✭ 106 (-3.64%)
Mutual labels:  mysql, postgresql
Open Bank Mark
A bank simulation application using mainly Clojure, which can be used to end-to-end test and show some graphs.
Stars: ✭ 81 (-26.36%)
Mutual labels:  kafka, postgresql
Clitools
🔧 CliTools for Docker, PHP / MySQL development, debugging and synchonization
Stars: ✭ 86 (-21.82%)
Mutual labels:  mysql, synchronization
Prisma
Next-generation ORM for Node.js & TypeScript | PostgreSQL, MySQL, MariaDB, SQL Server, SQLite & MongoDB (Preview)
Stars: ✭ 18,168 (+16416.36%)
Mutual labels:  mysql, postgresql
Sql
MySQL & PostgreSQL pipe
Stars: ✭ 81 (-26.36%)
Mutual labels:  mysql, postgresql
Electrocrud
Database CRUD Application Built on Electron | MySQL, Postgres, SQLite
Stars: ✭ 1,267 (+1051.82%)
Mutual labels:  mysql, postgresql
Qtl
A friendly and lightweight C++ database library for MySQL, PostgreSQL, SQLite and ODBC.
Stars: ✭ 92 (-16.36%)
Mutual labels:  mysql, postgresql

bireme

Build Status

中文文档

Getting Started Guide

Bireme is an incremental synchronization tool for the Greenplum / HashData data warehouse. It currently supports MySQL, PostgreSQL and MongoDB data sources.

Greenplum is an advanced, fully functional open source data warehouse that provides powerful and fast analysis of the amount of petabyte data. It is uniquely oriented for large data analysis and is supported by the world's most advanced cost-based query optimizer. It can provide high query performance over large amounts of data.

HashData is a flexible cloud data warehouses built based on Greenplum.

Bireme uses DELETE + COPY to synchronize the modification records of the data source to Greenplum / HashData. This mode is faster and better than INSERT + UPDATE + DELETE.

Features and Constraints:

  • Using small batch loading to enhance the performance of data synchronization. The default load delay time is 10 seconds.
  • All tables must have primary keys in the target database.

1.1 Data Flow

data_flow

Bireme supports synchronization work of multiple data sources. It can simultaneously read records from multiple data sources in parallel, and load records to the target database.

1.2 Data Source

1.2.1 Maxwell + Kafka

Maxwell + Kafka is a data source type that bireme currently supports. The structure is as follows:

maxwell

  • Maxwell is an application that reads MySQL binlogs and writes row updates to Kafka as JSON.

1.2.2 Debezium + Kafka

Debezium + Kafka is another data source type that bireme currently supports. The structure is as follows:

debezium

  • Debezium is a distributed platform that turns your existing databases into event streams, so that applications can see and respond immediately to each row-level change in the databases.

1.3 How does bireme work

Bireme reads records from the data source, delivers them into separate pipelines. In each pipeline, bireme converts them into internal format and caches them. When the cached records reaches a certain amount, they are merged into a task. Each task contains two collections, delete collection and insert collection. It finally updates the records to the target database.

Each data source may have several pipelines. For maxwell, each Kafka partition corresponds to a pipeline and for debezium, each Kafka topic corresponds to a pipeline.

bireme

The following picture depicts how change data is processed in a pipeline.

pipeline

1.4 Introduction to configuration files

The configuration files consist of two parts:

  • Basic configuration file: The default is config.properties, which contains the basic configuration of bireme.
  • Table mapping file: <source_name>.properties. Each data source corresponds to a file, which specifies the table to be synchronized and the corresponding table in the target database. <Source_name> is specified in the config.properties file.

1.4.1 config.properties

Required parameters

Parameters Description
target.url Address of the target database. Format:
jdbc:postgresql://<ip>:<port>/<database>
target.user The user name used to connect to the target database
target.passwd The password used to connect to the target database
data.source Specify the data source, which is <source_name>, with multiple data sources separated by commas, ignoring whitespace
<source_name>.type Specify the type of data source, for example maxwell

Note: The data source name is just a symbol for convinence. It can be modified as needed.

Parameters for Maxwell data source

Parameters Description
<source_name>.kafka.server Kafka address. Format:
<ip>:<port>
<source_name>.kafka.topic Corresponding topic of data source
<source_name>.kafka.groupid Kafka consumer group id. Default value is bireme

Parameters for Debezium data source

Parameters Description
<source_name>.kafka.server Kafka address. Format:
<ip>:<port>
<source_name>.kafka.groupid Kafka consumer group id. Default value is bireme
<source_name>.kafka.namespace Debezium's name.

Other parameters

Parameters Description Default
pipeline.thread_pool.size Thread pool size for Pipeline 5
transform.thread_pool.size Thread pool size for Transform 10
merge.thread_pool.size Thread pool size for Merge 10
merge.interval Maxmium interval between Merge in milliseconds 10000
merge.batch.size Maxmium number of Row in one Merge 50000
loader.conn_pool.size Number of connections to target database, which is less or equal to the number of Change Loaders 10
loader.task_queue.size The length of task queue in each Change Loader 2
metrics.reporter Bireme specifies two monitoring modes, consolo or jmx. If you do not need to monitor, you can specify this as none jmx
metrics.reporter.console.interval Time interval between metrics output in seconds. It is valid as long as metrics.reporter is console 10
state.server.port Port for state server 8080
state.server.addr IP address for state server 0.0.0.0

1.4.2 <source_name>.properties

In the configuration file for each data source, specify the table which the data source includes, and the corresponding table in the target database.

<OriginTable_1> = <MappedTable_1>
<OriginTable_2> = <MappedTable_2>
...

1.5 Monitoring

HTTP Server

Bireme starts a light HTTP server for acquiring current Load State.

When the HTTP server is started the following endpoints are exposed:

Endpoint Description
/ Get the load state for all data source.
/<data source> Get the load state for the given data source.

The result is organized in JSON format. Using parameter pretty will print the user-friendly result.

Example

The following is an example of Load State:

{
  "source_name": "XXX",
  "type": "XXX"
  "pipelines": [
    {
      "name": "XXXXXX",
      "latest": "yyyy-MM-ddTHH:mm:ss.SSSZ",
      "delay": XX.XXX,
      "state": "XXXXX"
    },
    {
      "name": "XXXXXX",
      "latest": "yyyy-MM-ddTHH:mm:ss.SSSZ",
      "delay": XX.XXX,
      "state": "XXXXX"
    },
  ]
}
  • source_name is the name of queried data source, as designated in the configuration file.
  • type is the type of data source.
  • pipelines is an array, every element in which corresponds to a pipeline. (Every data source may have several separate pipelines.)
  • name is the pipeline's name.
  • latest is produce time of latest change data that have been successfully loaded to hashdata.
  • delay is the time period for change data from entering bireme to being committed to data source.
  • state is the pipeline's state.

1.6 Reference

Maxwell Reference
Debezium Reference
Kafka Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].