Application Development Using BoltDB

Abstract

We've been taught for decades that we need a complex database server to run our applications. However, these servers incur a huge performance hit for most queries and they are frequently misconfigured because of their operational complexity which can cause slowness or downtime. In this talk, I'll show you how to use a local, pure Go key/value store called BoltDB to build applications that are both simple and fast. We will see how development and deployment become a breeze once you ditch your complex database server.

Introduction

Software is too complex and too slow. We've seen the speed of CPUs increase by orders of magnitude in the past few decades yet our applications seem to require more hardware than ever. In the past several years I've used embedded key/value databases for my applications because of their simplicity and speed. Today I'm going to walk you through building a simple application using a pure Go key/value store I wrote called BoltDB.

What is a embedded key/value database?

Before we dive in, let's talk about what an embedded key/value store even is! "Embedded" refers to the database actually being compiled into your application instead of a database server which you connect to over a socket. A good example of an embedded database is SQLite.

However, SQLite is a relational database so let's talk about what "key/value" means next. Key/value databases are extremely simple. They map a "key", which is just a unique set of bytes, to a "value", which is an arbitrary set of bytes. It helps to think of this in terms of relational databases. Your "key" would be your primary key and your value would be the encoded row. In fact, most database servers utilize a key/value database internally to store rows. Essentially, the key/value database is just a persisted map.

Some key/value databases allow you to have multiple key/value mappings. In BoltDB, these are called "buckets". Every key is unique in a bucket and points to a value. Many times you can think of buckets like tables in a relational database. You may have a "users" bucket or a "products" bucket.

Just as there are many database servers to choose from, there are also many types of embedded key/value databases with different trade offs. Sometimes you'll trade write performance for read performance or you'll trade transactional safety for performance. For example, BoltDB is read optimized and supports fully serializable ACID transactions. This makes it good for many read- heavy applications that require strong guarantees.

Getting started with BoltDB

One of the best things about using BoltDB is that installation process is so simple. You don't need to install a server or even configure it. You just use "go get" like any other Go package and it'll work on Windows, Mac, Linux, Raspberry Pi, and even iOS and Android.

$ go get github.com/boltdb/bolt

Object encoding

One feature of relational databases that many of us take for granted is that they handle encoding rows into bytes on disk. Since Bolt only works with byte slices we'll need to handle that manually. Lucky for us, there are a LOT of options for object serialization.

In Go, one of the most popular serialization libraries is Protocol Buffers (also called "protobufs"). One implementation, called gogoprotobuf, is also one of the fastest. With protobufs, we declare our serialization format and then generate Go files for doing this quickly.

Let's take a look at an example application to see how we'd do this. This application shows how to do CRUD for a simple "User" data store but I've also built many other less traditional applications on Bolt such as message queues and analytics.

Domain types

In our app, we have a single domain type called "User" with two fields: ID & username. You can expand out to more types and nest objects but we'll stick with a single object to keep things simple.

I like to separate out my domain types from my encoding types by placing my encoding types in a subpackage called "internal". I do this for two reasons. First, it keeps the generated protobufs code separate. And second, the "internal" package is inaccessible from other packages outside our app and it's hidden from godoc.

Inside our protobuf definition we can see that it matches our domain type with a few exceptions. Since it's our binary representation we have to specify a size for our integer type. Also, you'll notice numbers on the right. These are essentially field IDs when it's encoded. When you add or remove fields you don't need to do a migration like with a relational database. You simply add a field with a higher number or delete a field.

Back in our store.go we can add code to generate our protobufs. This line calls the protobuf compiler, "protoc", and will generate to internal/internal.pb.go. If we look in there we can see it's a bunch of ugly generated code.

Our domain type will convert to and from this protobuf type but we can hide it all behind the encoding.BinaryMarshaler & encoding.BinaryUnmarshaler interfaces. Our MarshalBinary() simply copies our fields in and marshals them and the UnmarshalBinary() unmarshals the data and copies the fields out. This is a bit more work than in relational databases but it's easy to write, test, and migrate.

Initializing the store

Our Store type will be our application wrapper around our Bolt database. To open a bolt.DB, we simply need to pass in the file path and the file permissions to set if the file doesn't exist. This will use the umask so I typically set my permissions to 0666 and let users set the umask to filter that at runtime.

Once the database is open, we'll start a writable transaction. The Begin() function is what starts a transaction. The true argument means that it's a "writable" transaction. Bolt can have as many read transactions as you want but only one write transaction gets processed at a time. This means it's important to keep updates small and break up really large updates into smaller chunks. All transactions operate under serializable isolation which means that all data will be a snapshot of exactly how it was when the transaction started -- even if other write transactions commit while the read transaction is in process.

The deferred rollback can look odd since we want to commit the transaction at end. It's important in case you return an error early or your application panics. All transactions need rollback or commit when they're done or else they can block other operations later.

Within our transaction we call CreateBucketIfNotExists() to create our bucket for our users. This is similar to a "CREATE TABLE IF NOT EXISTS" in SQL. If the bucket doesn't exist then it's created. Otherwise it's ignored. Calling this during initialization means that we won't have to check for it whenever we use the "users" bucket. It's guaranteed to be there.

Finally, we commit our transaction and return the error, if any occurred while saving. Bolt does not allow partial transactions so if a disk error occurs then your entire transaction will be rolled back. The deferred rollback that we called earlier will be ignored for this transaction since we have successfully committed.

Closing the store is a simple task. Simply call Close() on the Bolt database and it will release it's exclusive file lock and close the file descriptor.

Creating a user

Now that we have our database ready, let's create a user. In our CreateUser() method, we'll start by creating a writable transaction just like we did before. Then we'll grab the "Users" bucket from the transaction. We don't need to check if it exists because we created it during initialization.

Next, we'll create a new ID for the user. Bolt has a nice feature called sequences which are transactionally safe autoincrementing integers for each bucket. Whenever we call NextSequence(), we'll get the bucket's next integer. Once we grab it, we assign it to our user's ID.

Now our user is ready to be marshaled. We call MarshalBinary() and we get a set of bytes which represents our encoded user. Easy peasy!

Since Bolt only works with bytes, we'll need to convert our ID to bytes. I recommend using the binary.BigEndian.PutUint64() function for this. I use big endian because it will sort our IDs in the right order.

[show big endian vs little endian on slide]

We'll use the bucket's Put() method to associate our encoded user with the encoded ID. Then we'll commit and our data is saved.

Retrieving the user

Creating a user is just a matter of converting objects to bytes so retrieving a user is simply converting bytes to objects. In our User() method we'll start a transaction but this time we'll pass in false to specify a read-only transaction. Again, read-only transactions can run completely in parallel so this scales really well across multiple CPU cores.

Once we have our transaction, we call Get() on our bucket with an encoded user ID and we get back the encoded user bytes. We can call UnmarshalBinary() on a new User and decode the data. If the encoded bytes comes back as nil then we know that the user doesn't exist and we can simply return a nil user.

Retrieving multiple users

Reading one user is good but many times we want to return a list of all users. For this we'll need to use a Bolt cursor. A cursor is simply an object for iterating over a bucket in order. It has a handful of methods we can use to move forward, back or even jump around.

In our Users() method we'll grab a read-only transaction and a cursor from our bucket. Then we'll iterate over every key/value in our bucket. We can collapse it all into a simple "for" loop where we call First() at the beginning and then Next() until we receive a nil key. For each value, we'll unmarshal the user and add it to our slice.

If you need reverse sorting, you can call Last() and then Prev() on the cursor. You can also use the Seek() method to jump to a specific spot. For example, if we wanted to do pagination we could pass in an "options" object into the method and have an offset and limit.

Updating a user

Now that we've created a user, let's update it. Let's look at the SetUsername() method. This time we'll mix it up and use the Update() method instead of Begin(). This method works just like Begin(true) except that it executes a function in the context of the transaction. If the function returns nil then the transaction commits. Otherwise if it returns an error or panics then it will rollback the transaction.

First we'll retrieve our user by ID and unmarshal. In this case we're combining the Get() and UnmarshalBinary() into a compound if block. I find it easier to read if I group these related types of calls together. Next we simply update the username on our user we just unmarshaled.

Now that we have our updated user object, we can simply remarshal it and overwrite the previous value by calling Put() again.

Deleting a user

Finally, the last part of our CRUD store is the deletion. Delete is incredibly simply. Simply call the Delete() method on the bucket. That's it!

Bolt in Practice

That was the basics of doing CRUD operations with Bolt and we can talk about more advanced use cases in a minute but first let's look at what running Bolt in production looks like.

Internally, Bolt structures itself as a B+tree of pages which requires a lot of random access at the file system so it's recommended that you run Bolt on an SSD. Other embedded databases such as LevelDB are optimized for spinning disks.

Bolt also maps the database to memory using a read-only mmap() so byte slices returned from buckets cannot be updated (or else it will SEGFAULT) and the byte slices are only valid for the life of the transaction. The memory map provides two amazing benefits. First, it means that data is never copied from the database. You're accessing it directly from the operating system's page cache. Second, since it's in the OS page cache, your hot data will persist in memory across application restarts.

Backup & restore

From an operations standpoint, Bolt just uses a single file on disk so it's simple to manage. However, as a library, there's not a standard CLI command to backup your database but Bolt does provide a great option.

Transactions in Bolt implement the io.WriterTo interface which means they can copy an entire snapshot of the database to an io.Writer with one line of code. Depending on your application you may wish to provide an HTTP endpoint so you can curl your backups or you can build an hourly snapshot to Amazon S3.

Another option in the works is a streaming transaction log so that you can attach a process over the network to be an async replica. This is similar to how Postgres replication works. This is still early in development though and is not currently available.

Performance

While performance is not a primary goal of Bolt, it is an important feature to talk about. Bolt is read optimized so if your workload regularly consists of tens of thousands of writes per second then you may want to look at write optimized databases such as LevelDB.

Benchmarks

Benchmarks are not typically very useful but it's good to know a ballpark of what a database can handle. I typically tell people the following on a machine with an SSD. Expect to get up to 2,000 random writes/sec without optimization. If you're bulk loading and sorting data then you can get up to 400K+ writes/sec. Typically you're throttled by your hard drive speed.

On the read side, it depends on if your data is in the page cache. Typical CRUD applications have maybe 10-20% of their data hot at any given time. That means that if you have a 20GB database then 2 - 4GB is hot and that will be resident in memory assuming you have that much RAM. For hot data, you can expect a read transaction and a single bucket read to take a 1-2µs. If you're iterating over hot data then you can see speeds of 20M keys/second. Again, all this data is in the page cache and there's no data copy so it's really fast.

If your data is not hot then you'll be limited by the speed of your hard drive. Expect reads of cold data on an SSD to take hundreds of microseconds or a few milliseconds depending on the size of your database.

Scaling with Bolt

One of the biggest criticisms of embedded databases is that they are embedded and don't have a means of scaling or clustering. This is true, however, it's not necessarily a reason to avoid embedded databases. There are several strategies that can be used to obtain the safety required for your application while still using an embedded database.

Vertical scaling

The easiest way to scale is still, by far, by scaling vertically. If your bottleneck is with read queries then simply adding more CPU cores will scale your Bolt application near linearly. This answer may sound too simplistic but that's the beauty of it. It's really simple.

Horizontal scaling via sharding

Many applications -- especially SaaS applications -- can be partitioned by users or accounts. This means that you can either assign each account to its own database or use strategies like consistent hashing to group accounts into a number of partitions. This will allow you to add additional machines and rebalance the load of your application.

Data integrity in the face of catastrophic failure

Until streaming replication is ready for Bolt, windows of data loss are still an issue that needs consideration when building with an embedded database. Many cloud providers provide highly safe, redundant storage but even those can fail every great once in a while.

For some applications, a daily backup may be reliable enough. For others, a standby machine taking snapshots every 10 minutes may be required. Many applications store critical data such as finances in a third party service like Stripe so a 10 minute window may be acceptable.

Again, a Bolt database is simply a file and can be copied extremely quickly. Expect a copy to take 3-5 seconds per gigabyte on an SSD.

Conclusion

I believe that local key/value databases meet the requirements for many applications while providing a simple, fast development experience. I have shown you how to get up and running with Bolt and build a simple CRUD data store. The code is not just straightforward but also incredibly fast and can vertically scale for many workloads. Please consider using a local key/value database in your next project!

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

benbjohnson / application-development-using-boltdb

Programming Languages