All Projects → nvanbenschoten → epaxos

nvanbenschoten / epaxos

Licence: Apache-2.0 License
A pluggable implementation of the Egalitarian Paxos Consensus Protocol

Programming Languages

go
31211 projects - #10 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to epaxos

Copycat
A novel implementation of the Raft consensus algorithm
Stars: ✭ 551 (+1312.82%)
Mutual labels:  distributed-systems, replication, consensus
Dragonboat
Dragonboat is a high performance multi-group Raft consensus library in pure Go.
Stars: ✭ 3,983 (+10112.82%)
Mutual labels:  distributed-systems, consensus, paxos
Translations
🐼 Chinese translations for classic IT resources
Stars: ✭ 6,074 (+15474.36%)
Mutual labels:  distributed-systems, consensus, paxos
Nuraft
C++ implementation of Raft core logic as a replication library
Stars: ✭ 428 (+997.44%)
Mutual labels:  distributed-systems, replication, consensus
Awesome Distributed Systems
A curated list to learn about distributed systems
Stars: ✭ 7,263 (+18523.08%)
Mutual labels:  distributed-systems, consensus, paxos
Bifrost
Pure rust building block for distributed systems
Stars: ✭ 118 (+202.56%)
Mutual labels:  distributed-systems, consensus
Zatt
Python implementation of the Raft algorithm for distributed consensus
Stars: ✭ 119 (+205.13%)
Mutual labels:  distributed-systems, consensus
Verdi Raft
An implementation of the Raft distributed consensus protocol, verified in Coq using the Verdi framework
Stars: ✭ 143 (+266.67%)
Mutual labels:  distributed-systems, consensus
Awesome Substrate
A curated list of awesome projects and resources related to the Substrate blockchain development framework.
Stars: ✭ 228 (+484.62%)
Mutual labels:  distributed-systems, consensus
Go Craq
CRAQ (Chain Replication with Apportioned Queries) in Go
Stars: ✭ 75 (+92.31%)
Mutual labels:  distributed-systems, replication
Seaweedfs
SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
Stars: ✭ 13,380 (+34207.69%)
Mutual labels:  distributed-systems, replication
vrrm
rough code for running consensus
Stars: ✭ 18 (-53.85%)
Mutual labels:  replication, consensus
Etcd
Distributed reliable key-value store for the most critical data of a distributed system
Stars: ✭ 38,238 (+97946.15%)
Mutual labels:  distributed-systems, consensus
Hermes
Hermes: a fault-tolerant replication protocol, implemented over RDMA, guaranteeing linearizability and achieving low latency and high throughput.
Stars: ✭ 105 (+169.23%)
Mutual labels:  distributed-systems, replication
Swim Js
JavaScript implementation of SWIM membership protocol
Stars: ✭ 135 (+246.15%)
Mutual labels:  distributed-systems, consensus
Library
Collection of papers in the field of distributed systems, game theory, cryptography, cryptoeconomics, zero knowledge
Stars: ✭ 100 (+156.41%)
Mutual labels:  distributed-systems, consensus
Atomix
A reactive Java framework for building fault-tolerant distributed systems
Stars: ✭ 2,182 (+5494.87%)
Mutual labels:  distributed-systems, consensus
Distributed-Algorithms
利用 Go 语言实现多种分布式算法
Stars: ✭ 53 (+35.9%)
Mutual labels:  distributed-systems, paxos
paxakos
Rust implementation of Paxos consensus algorithm
Stars: ✭ 79 (+102.56%)
Mutual labels:  consensus, paxos
Raft-Paxos-Sample
MIT6.824实现分布式一致性算法——Raft&Paxos
Stars: ✭ 37 (-5.13%)
Mutual labels:  consensus, paxos

Egalitarian Paxos

A pluggable implementation of the Egalitarian Paxos Consensus Protocol

Paxos is a protocol for solving consensus through state machine replication in an asynchronous environment with unreliable processes. It can tolerate up to F concurrent replica failures with 2F+1 total replicas. This consensus protocol is then extended with a stable leader optimization to a replication protocol (commonly referred to as Multi-Paxos) to assign global, persistent, total order to a sequence of client updates. The protocol works by having multiple replicas work in parallel to maintain the same state. This state is updated on each request from a client by each replica, allowing it to be automatically replicated and preserved even in the case of failures. The basic algorithm was famously described by Leslie Lamport in his 1998 paper, The Part-Time Parliament. It was later clarified in his follow-up paper from 2001, Paxos Made Simple.

Egalitarian Paxos is an efficient, leaderless variation of this protocol, proposed by Iulian Moraru in his 2013 paper, There Is More Consensus in Egalitarian Parliaments. It provides strong consistency with optimal wide-area latency, perfect load-balancing across replicas (both in the local and the wide area), and constant availability for up to F failures. Concretely, it provides the following properties:

  • High throughput, low latency
  • Constant availability
  • Load distributed evenly across all replicas (no leader)
  • Limited by fastest replicas, not slowest
  • Can always use closest replicas (low latency)
  • 1 round-trip fast path

It does so by breaking the global command slot space into subspaces, each owned by a single replica. Replicas then attach ordering constraints to each command while voting on them to allow for proper ordering during command execution. For more intuition on how this works, check out the presentation given at SOSP '13, and for a full technical report and proof of correctness of the protocol, check out A Proof of Correctness for Egalitarian Paxos.

This library is implemented with a minimalistic philosophy. The main epaxos package implements only the core EPaxos algorithm, with storage handling, network transport, and physical clocks left to the clients of the library. This minimalism buys flexibility, determinism, and performance. The design was heavily inspired by CoreOS's raft library.

Features

The epaxos implementation is a full implementation of the Egalitarian Paxos replication protocol. Features include:

  • Command replication
  • Command compaction
  • Persistence
  • Failure Recovery

Features not yet implemented:

  • Explicit Prepare Phase
  • Optimized Egalitarian Paxos (smaller fast path quorum)
  • Membership changes
  • Batched commands
  • Thrifty operation (see paper)
  • Quorum leases
  • Snapshots

Building

Run make or make test to run all tests against the library

Run make clean to clean all build artifacts

Run make check to perform linting and static analysis

Testing

The project comes with an automated test suite which contains both direct unit tests to test pieces of functionality within the EPaxos state machine, and larger network tests that test a network of EPaxos nodes. The unit tests are scattered throughout the epaxos/*_test.go files, while the network tests are located in the epaxos/epaxos_test.go file.

To run all tests, run the command make test

Library Interface

The library is designed around the the epaxos type, which is a single-threaded state machine implementing the Egalitarian Paxos consensus protocol. The state machine can be interacted with only through a Node instance, which is a thread-safe handle to a epaxos state machine.

Because the library pushes tasks like storage handling and network transport up to the users of the library, these users have a few responsibilities. In a loop, the user should read from the Node.Ready channel and process the updates it contains. These Ready struct will contain any updates to the persistent state of the node that should be synced to disk, and messages that need to be delivered to other nodes, and any commands that have been successfully committed and that are ready to be executed. The user should also periodically call Node.Tick in regular interval (probably via a time.Ticker).

Together, the state machine handling loop will look something like:

for {
    select {
    case <-ticker.C:
        node.Tick()
    case rd := <-node.Ready():
        for _, msg := range rd.Messages {
            send(msg)
        }
        for _, cmd := range rd.ExecutableCommands {
            execute(cmd)
        }
    case <-ctx.Done():
        return
    }
}

To propose a change to the state machine, first construct a pb.Command message. The pb.Command message contains both an arbitrary byte slice to hold client updates and additional metadata fields to related to command interference. Use of these metadata fields in pb.Command is the mechanism in which clients of the library express application-specific command interference semantics. pb.Commands operate in a virtual keyspace, and each command operates over a subset of this keyspace, which is expressed in the Span field. pb.Commands can also be specified as reads or writes using the Writing field. Interference between commands is then defined as two commands whose Spans overlap, where at-least one of the commands is Writing.

After a pb.Command is constructed with the desired update and the necessary interference constrains, call:

node.Propose(ctx, command)

Once executable, the pb.Command will appear in the rd.ExecutableCommands slice.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].