All Projects → marceloboeira → Voik

marceloboeira / Voik

Licence: mit
♒︎ [WIP] An experimental ~distributed~ commit-log

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Voik

Dafka
Dafka is a decentralized distributed streaming platform
Stars: ✭ 83 (-58.5%)
Mutual labels:  streaming, distributed
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+5589%)
Mutual labels:  kafka, streaming
Filodb
Distributed Prometheus time series database
Stars: ✭ 1,286 (+543%)
Mutual labels:  kafka, distributed
Eventql
Distributed "massively parallel" SQL query engine
Stars: ✭ 1,121 (+460.5%)
Mutual labels:  streaming, distributed
Kafka Streams In Action
Source code for the Kafka Streams in Action Book
Stars: ✭ 167 (-16.5%)
Mutual labels:  kafka, streaming
Hydra
A real-time data replication platform that "unbundles" the receiving, transforming, and transport of data streams.
Stars: ✭ 68 (-66%)
Mutual labels:  kafka, streaming
Bigdata Notebook
Stars: ✭ 100 (-50%)
Mutual labels:  kafka, streaming
Js Ipfs
IPFS implementation in JavaScript
Stars: ✭ 6,129 (+2964.5%)
Mutual labels:  immutable, distributed
Redpanda
Redpanda is the real-time engine for modern apps. Kafka API Compatible; 10x faster 🚀 See more at vectorized.io/redpanda
Stars: ✭ 3,114 (+1457%)
Mutual labels:  kafka, streaming
Streamline
StreamLine - Streaming Analytics
Stars: ✭ 151 (-24.5%)
Mutual labels:  kafka, streaming
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (+360.5%)
Mutual labels:  kafka, distributed
Liftbridge
Lightweight, fault-tolerant message streams.
Stars: ✭ 2,175 (+987.5%)
Mutual labels:  nats, streaming
Stream Reactor
Streaming reference architecture for ETL with Kafka and Kafka-Connect. You can find more on http://lenses.io on how we provide a unified solution to manage your connectors, most advanced SQL engine for Kafka and Kafka Streams, cluster monitoring and alerting, and more.
Stars: ✭ 753 (+276.5%)
Mutual labels:  kafka, streaming
Fs2 Kafka
Kafka client for functional streams for scala (fs2)
Stars: ✭ 75 (-62.5%)
Mutual labels:  kafka, streaming
Kafka Connect Jdbc
Kafka Connect connector for JDBC-compatible databases
Stars: ✭ 698 (+249%)
Mutual labels:  kafka, streaming
Streamx
kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Stars: ✭ 96 (-52%)
Mutual labels:  kafka, streaming
Sparta
Real Time Analytics and Data Pipelines based on Spark Streaming
Stars: ✭ 513 (+156.5%)
Mutual labels:  kafka, streaming
Jeesuite Libs
分布式架构开发套件。包括缓存(一二级缓存、自动缓存管理)、队列、分布式定时任务、文件服务(七牛、阿里云OSS、fastDFS)、日志、搜索、分布式锁、分布式事务、集成dubbo、spring boot支持以及常用的工具包等。
Stars: ✭ 584 (+192%)
Mutual labels:  kafka, distributed
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-30%)
Mutual labels:  kafka, streaming
Onyx
Distributed, masterless, high performance, fault tolerant data processing
Stars: ✭ 2,019 (+909.5%)
Mutual labels:  streaming, distributed

An experimental distributed streaming platform

Status

Currently, working in the foundation of the storage layer.

Found an issue? Feel like contributing? Make sure to check out our contributing guide first.

To know more about component internals, performance and references, please check the architecture internals documentation.

Project Goals

  • Learn
  • Implement a Kinesis-like streaming-service
  • Single binary
  • Easy to Host, Run & Operate

Commands

Available make commands

  • make build - Builds the application with cargo
  • make build_release - Builds the application with cargo, with release optimizations
  • make docker_test_watcher - Runs funzzy on linux over docker-compose
  • make docs - Generates the GitHub Markdown docs (At the moment only mermaid)
  • make format - Formats the code according to cargo
  • make help - Lists the available commands
  • make install - Builds a release version and installs to your cago bin path
  • make run - Runs the newly built
  • make test - Tests all features

Architecture

At this point, we have only the foundation of the Storage layer implemented. The other parts of the above picture are for demonstration purposes of future componentes.

Storage

The storage layer is where the data is persisted for long-term reading.

CommitLog

The main component of the whole system is the commit-log, an abstraction manages reads and writes to the log by implementing an immutable, append-only, file-backed sequence of "records", or chunks of data/events that are transmited from producers to consumers.

Records can be written to the log, always appending the last record over and over.

e.g.:

                          current cursor
 segment 0                       ^
 |-------------------------------|
 | record 0  |  record 1  |  ... |  --> time
 |-------------------------------|

In order to manage and scale read and writes, the commit-log split groups of records into Segments, managing to write to a single segment until it reaches a certain, specified size.

Each time a record is written, the segment is trusted to have enough space for the given buffer, then the record is written to the current segment, and the pointer is updated.

More info in the commit_log/src/lib.rs file.

Segment

A Segment is a tuple abstraction to manage the Index and Log files.

Every Segment is composed of a log-file and an index, e.g.:

00000000000011812312.log
00000000000011812312.idx

The role of the segment is to manage writes to the logfile and ensure the entries can be read later on by doing lookups in the index.

On every write, the segment writes an entry to the index with the record's position and size, in the log-file, for later use.

The segment also manages the size of the log file, preventing it from being written once it reaches the specified.

When a segment is full, the commit log makes sure to rotate to a new one, closing the old one.

See how it looks like on disk (on a high-level):

                                                       current cursor
segment 0                                                     ^
|-------------------------------|                             |
| record 0  |  record 1  |  ... | segment 1 (current)         |
|-------------------------------|-----------------------------| --> time
                                |  record 2  | record 3 | ... |
                                |-----------------------------|

Under the hood is a bit more complex, the management of writing to the file to disk is of the Segments', as well as managing the Index file.

More info in the commit_log/src/segment.rs and commit_log/src/segment/index.rs and log.rs files.

Log file

The log file is a varied-size sequence of bytes that is storing the content of the records produced by the producers. However, the log itself doesn't have any mechanism for recovery of such records. That's responsibility of the index.

Once initialized, the log-file is truncated to reach the desired value and reserve both memory and disk space, the same for the index.

                         current cursor
                                ^
|-------------------------------|
| record 0  |  record 1  |  ... |----> time
|-------------------------------|

Neither reads nor writes to the index are directly triggering disk-level actions.

Both operations are being intermediated by a memory-mapping buffers, managed by the OS.

More info in the commit_log/src/segment/log.rs file.

Index file

The role of the index is to provide pointers to records in the log file. Each entry of the index is 20 bytes long, 10 bytes are used for the offset address of the record in the log file, the other 10 bytes for the size of the record.

e.g.:

                          current cursor
                                 ^
 |-------------------------------|
 | offset-size | offset-size |...|----> time
 |-------------------------------|

There is no separator, it's position-based.

e.g.:

00000001000000000020
---------------------
  offset  |  size

* 000000010 -> offset
* 000000020 -> size

Neither reads nor writes to the index are directly triggering disk-level actions.

Both operations are being intermediated by a memory-mapping buffers, managed by the OS.

More info in the commit_log/src/segment/index.rs file.

Performance

These are preliminar and poorly collected results, yet it looks interesting:

Storage (Tests are completely offline, no network¹ ...)

  • Setup 1:
OS: macOS Mojave 10.14.4 (18E226)
CPU: 2,5 GHz Intel Core i7
RAM: 16 GB 2133 MHz LPDDR3
HD: 256 GB SSD Storage
---------------
Segment size: 20 MB
Index size: 10 MB
5 GB worth records written in 37.667706s
5 GB worth cold records read in 1.384433s
5 GB worth warm records read in 1.266053s

Per-segment²:

  • ~130 MB/s on write

  • ~3.7 GB/s on cold read (while loading into memory pages)

  • ~4.2 GB/s on warm read (already loaded into memory pages)

  • Setup 2:

OS: macOS Mojave 10.14.5 (18F203)
CPU: 2,9 GHz Intel Core i9
RAM: 32 GB 2400 MHz DDR4
HD: 500 GB SSD Storage
---------------
Segment size: 20 MB
Index size: 10 MB
5 GB worth records written in 26.851791s
5 GB worth cold records read in 141.969ms
5 GB worth warm records read in 124.623ms

Per-segment²:

  • ~187 MB/s on write

  • ~35 GB/s on cold read (while loading into memory pages)

  • ~41 GB/s on warm read (already loaded into memory pages)

  • Setup 3:

OS: macOS Mojave 10.14.5 (18F203)
CPU: 2,9 GHz Intel Core i9
RAM: 32 GB 2400 MHz DDR4
HD: 500 GB SSD Storage
---------------
Segment size: 50 MB
Index size: 20 MB
10 GB worth records written in 54.96796s
10 GB worth cold records read in 437.933ms
10 GB worth warm records read in 310.853ms

Per-segment²:

  • ~181 MB/s on write
  • ~22 GB/s on cold read (while loading into memory pages)
  • ~21 GB/s on warm read (already loaded into memory pages)

Notes:

  • ¹ - Offline - no network overhead taken into account, network will be a big player on the overhead. However, the focus now is storage.
  • ² - Per-segment performance, in a comparinson with kinesis/kafka that would be the per-shard value. If you were to have 10 shards, you could achieve 10x that, limited by external factors, HD/CPU/...

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].