All Projects → greglook → vault

greglook / vault

Licence: Unlicense license
Content-addressable data storage system

Programming Languages

clojure
4091 projects

NOTE: I've stopped working on Vault in favor of splitting the pieces out into smaller, composable libraries. This will also improve compatibility with other similar systems, especially IPFS. The current pieces are:

The remainder of the README is preserved here for posterity.

Vault

Build Status Coverage Status Dependency Status

Vault is a content-addressable, version-controlled, hypermedia datastore which provides strong assertions about the integrity and provenance of stored data. This is heavily inspired by the following projects:

Vault does not aim to be (directly) compatible with any of these, though many of the ideas are similar. Why use a new data storage system? See some comparisons to other systems.

System Layers

This section provides a quick tour of the concepts in Vault. The system is broken into several layers with tightly-scoped domains.

Blob Storage

At the lowest level, Vault is built on content-addressable storage. Data is stored in blobs, which are byte sequences identified by a cryptographic hash of their contents. The combination of a hash algorithm and the corresponding digest is enough information to securely and uniquely identify a blob. These hash-ids are formatted like a URN:

sha256:2f72cc11a6fcd0271ecef8c61056ee1eb1243be3805bf9a9df98f92f7636b05c

A blob store is a system which saves and retrieves blob data. Blob stores support a very simple interface; they must store, retrieve, and enumerate the contained blobs. The simplest type of blob storage is a hash map in memory. Another simple example is a store backed by a local file system, where blobs are stored as files.

Structured Data

Blob content is parsed and classified in the data layer. There are three general classes of blobs: data blobs, key blobs, and raw blobs.

Vault represents structured data using EDN values. Data blobs are recognized by the header tag #vault/data as the first line of text in the blob. An example data blob representing a file might look like this:

#vault/data
{:name "foo.clj"
 :content #bytes/raw #vault/blob "sha256:461566632203729fe8e1c6f373e53b5618069817f00f916cceb451853e0b9f75"
 ...}

Blob references through hash-ids provide a consistent way to link to immutable data, so it is simple to build data structures which automatically deduplicate shared data. These are similar to Clojure's persistent collections; see the schema for hierarchical byte sequences for an example.

Link Paths

Structured data in Vault can be linked to other data by providing a vector of path keys and their corresponding hash identifier links in the :vault/links attribute. This provides a generic way to address tree-like data structures.

If blob A links to blob B with the "foo" key, then the uri sha256:<hash-of-A>/foo will resolve to blob B. Similarly, if blob B links to C as "bar", and C links to D as "baz", then the following URIs all resolve to the same blob:

sha256:<hash-of-A>/foo/bar/baz
sha256:<hash-of-B>/bar/baz
sha256:<hash-of-C>/baz
sha256:<hash-of-D>

See the path traversal doc for more details on how this is accomplished.

Identity and Mutable State

PGP public keys establish identity in Vault. The hash-id of these key blobs provides a secure identifier for a mutable reference. Each identity may be bound to a value by transaction blobs which are signed by the corresponding private key. This allows Vault to represent mutable data as a history of immutable values, similar to a Clojure reference type.

Signatures are provided as secondary values in a transaction blob, following the primary value:

{:key #vault/blob "sha256:461566632203729fe8e1c6f373e53b5618069817f00f916cceb451853e0b9f75"
 :signature #pgp/signature #bytes/bin "iQIcBAABAgAGBQJSeHKNAAoJEAadbp3eATs56ckP/2W5QsCPH5SMrV61su7iGPQsdXvZqBb2LKUhGku6ZQxqBYOvDdXaTmYIZJBY0CtAOlTe3NXn0kvnTuaPoA6fe6Ji1mndYUudKPpWWld9vzxIYpqnxL/ZtjgjWqkDf02q7M8ogSZ7dp09D1+P5mNnS4UOBTgpQuBNPWzoQ84QP/N0TaDMYYCyMuZaSsjZsSjZ0CcCm3GMIfTCkrkaBXOIMsHk4eddb3V7cswMGUjLY72k/NKhRQzmt5N/4jw/kI5gl1sN9+RSdp9caYkAumc1see44fJ1m+nOPfF8G79bpCQTKklnMhgdTOMJsCLZPdOuLxyxDJ2yte1lHKN/nlAOZiHFX4WXr0eYXV7NqjH4adA5LN0tkC5yMg86IRIY9B3QpkDPr5oQhlzfQZ+iAHX1MyfmhQCp8kmWiVsX8x/mZBLS0kHq6dJs//C1DoWEmvwyP7iIEPwEYFwMNQinOedu6ys0hQE0AN68WH9RgTfubKqRxeDi4+peNmg2jX/ws39C5YyaeJW7tO+1TslKhgoQFa61Ke9lMkcakHZeldZMaKu4Vg19OLAMFSiVBvmijZKuANJgmddpw0qr+hwAhVJBflB/txq8DylHvJJdyoezHTpRnPzkCSbNyalOxEtFZ8k6KX3i+JTYgpc2FLrn1Fa0zLGac7dIb88MMV8+Wt4H2d1c"
 :vault/type :vault/signature}

Search Indexing

Another important component of the system is a set of indexes of the data stored in Vault. Indexes can be thought of as a sorted list of tuples. Different indexes will store different subsets of the blob data.

Groups of indexes are collected together into a catalog. The two main catalogs in Vault are the blob graph and the database.

Applications

At the top level, applications are built on top of the data layer. An application defines semantics for a set of data types. Some example usages:

  • Snapshot filesystems for backup, taking advantage of deduplicated blobs to store only incremental changes.
  • Archive messages such as email, chat, and social media.
  • Store and flexibly organize media such as music and photos.
  • Maintain personal time-series data for Quantified Self tracking.

One significant advantage of building on a common data layer is the ability to draw relations between many different kinds of data. Information from a variety of systems can be correlated into more meaningful, higher-level aggregates.

Usage

To get started working with Vault, the command-line tool is the simplest interface. After initializing some basic configuration, you can use the tool to explore the contents of the blob store. Use -h --help or help to show usage information for any command. General usage is similar to git, with nested subcommands for various types of actions.

See the usage docs for more information. Please keep in mind that this software is still experimental and unstable!

License

This is free and unencumbered software released into the public domain. See the UNLICENSE file for more information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].