All Projects → josephg → Text Crdt Rust

josephg / Text Crdt Rust

Programming Languages

rust
11053 projects

Text CRDT Prototype

This is a prototype of a simple high performance CRDT for text. Its loosely based off automerge.

I wrote this because I want to have a simple data model to benchmark, to see how it performs. I suspect it'll be very fast once its working correctly - which would be a bit of a game changer

Each client / device has a unique ID, and each character typed on each device is assigned an incrementing sequence number (starting at 0). Each character in the document can thus be uniquely identified by the tuple of (client ID, sequence number). This allows any location in the document to be uniquely named, so an operation can then be defined by (preceeding character's (client id, sequence number), inserted string, client id, client seq).

Internals

We want the system to go fast at two main operations:

  • Text edit to CRDT operation (Eg, "user A inserts at position 100" -> "user A seq 1000 inserts at (B, 50)")
  • CRDT operation to text edit ("user A seq 1000 inserts at (B, 50)" -> "insert at document position 100")

Actually inserting text into a rope (or something) is reasonably easy, and ropey and xi-rope both seem very high performance. So in this library I'm worrying about the P2P operation boundary.

Internally, each client ID is mapped to a local integer. These integers are never sent over the wire, so it doesn't matter if they aren't common between peers.

Then we have two main data structures internally:

  • A modified B-Tree of the entire document in sequence order. Entries are run encoded - which is to say, sequences of (client: 1, seq: 100), (client: 1, seq: 101), (client: 1, seq: 102), ... are collapsed into entries of (client: 1, seq: 100, len: 5). Deleted characters are marked with a negative length. The tree's internal nodes store subtree sizes, so we can map from an entry in the tree to a document position (and back again) in O(log n) time.
  • Per client operation lists. These are simply arrays of pointers mapping each sequential sequence number to the leaf node in the tree containing that item. These lists could also be run-length encoded - which would save memory, but then we'd need to binary search to find any given location instead of doing a simple array lookup.

Optimizations to perform

  • [ ] Cache the last

Progress

  • [x] Basic btree structure implemented
  • [x] Check function to verify structure integrity
  • [x] Doc Location -> CRDT location name
  • [x] CRDT location -> Doc location
  • [x] Insert text
  • [ ] Remove text
  • [ ] Cleanup
  • [ ] Cache last location for each client
  • [ ] Tests
  • [ ] Benchmarks
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].