We migrated Vault's storage with no downtime

We've been running Vault for several years, and use it extensively for secret management and credential issuance. We've always configured Vault with the Cassandra storage backend. Cassandra is what we use for persistence across the Monzo stack, so when we first adopted Vault we decided to write a storage backend for Cassandra and contribute it to the Vault project. You can read more about how we use Vault at Monzo here.

For a long time this worked without issue. But we've recently had scaling issues as we've begun to use Vault more intensively. So we decided to move to the S3 backend. But it was critical we did this without Vault being unavailable.

The Cassandra backend wasn't scaling well

We designed the data model of the Cassandra backend to avoid indexes, as these are a feature of Cassandra that we generally avoid for performance reasons.

Instead, it stored a Vault storage entry many times, with one row in Cassandra for every segment of the path you're storing. For example, the key sys/expire/id/foo would have rows with a bucket field set to sys, sys/expire, sys/expire/id and sys/expire/id/foo. This lets you list all the paths that exist in a folder by providing a bucket like sys :

First, look up all keys in the bucket sys
You'll get a list of entries like sys/foo, sys/expire/id/foo and sys/expire/id/bar.
Trim this down to sys/foo and sys/expire/ (representing a subfolder)

The problem with this approach is that in some cases, Vault needs to list the contents of very large folders. In fact, it'll occasionally do a recursive search: find subfolders of sys, then find subfolders of those folders, and so on. In the Cassandra backend, this essentially means listing every entry that is under sys several times over. In practice, this is fairly likely to time out for large folders.

We contributed an improvement which makes this more efficient, but it would only buy us time. Eventually we'd scale to the point of problems again.

Timing out caused other problems

Sadly, Vault does a recursive directory search at a critical time, when a node becomes leader. An unsealed node, when it obtains leadership, needs to list all the 'lease' objects from storage so it can build in-memory expiration timers for tokens and other leases.

These objects are stored under sys/expire/id and when you have a lot of clients (as we do), you can have thousands of storage entries to list. This folder has a ton of churn, which makes the problem even worse. We're removing entries constantly, which leads to a lot of tombstones which Cassandra still has to scan over.

Vault is only interested in listing the filenames in that folder, and in some storage backends this would be very cheap to retrieve. But for Cassandra under this data model, this is an incredibly expensive operation: a large set of rows distributed over many nodes is listed repeatedly. If the operation times out, then the Vault node seals. This is because Vault can't function without these leases being read. Tokens wouldn't expire, which would be a major security problem.

This means on a perfectly normal leadership election, a node becoming leader might suddenly seal. And then another node will become leader, and that node might seal too. We've seen this continue until the entire cluster is down.

Essentially we didn't have a failover Vault node, as long as this bug was possible. We tried to mitigate the issue with aggressive Cassandra compaction, but eventually realised that the query was always going to be dangerously expensive.

We were thinking about migrating anyway

As Vault was becoming a more critical component, and in particular as we started to use it to generate Cassandra credentials, we were got more concerned about the tight coupling between Vault and Cassandra. When Vault's down, Cassandra clients will eventually stop working. When Cassandra's down, Vault will stop working. It's not quite a circular dependency, but it's a little too close for comfort.

So we started to think about what a low-maintenance, highly available storage backend would look like. Given the issues we were seeing, we decided to pause the Cassandra work until we could be more confident in Vault's availability.

We looked closely at Consul, etcd, RDS Postgres and S3. We already use etcd to help Vault elect a leader, so this seemed like a sensible choice. But S3 was also really attractive as it'd be near impossible to lose data and need zero management. Consul is Vault's only officially supported backend, but its not something we had a lot of experience running. Postgres is something we felt confident running on AWS RDS, but we had concerns about unavoidable (brief) downtime in certain upgrade scenarios. In practice given Vault's aggressive caching, that might not have been an issue.

One of the only concerns we had about S3 was consistency. Technically S3 doesn't give you read-your-own-write consistency for update and delete operations. But because of Vaults aggressive caching, reading your own write is really only necessary around a failover. Updates and deletes are also fairly rare, and generally only happen on lease renewal or expiry. Our usage is mostly just a lot of reads, with a few first-time writes (which are consistent in S3). As a result, we decided consistency wasn't a big concern for our use, and that the simplicity of S3 – especially its data model, which is perfect for Vault – made it a great choice.

The official migration path involved downtime

Vault ships with a migration tool, but it's meant to be run while Vault is down. Then you're supposed to start Vault back up pointing at the new storage backend. This guarantees data consistency, which makes a lot of sense, but we weren't super happy about Vault downtime. This doesn't cause many issues that'd affect Monzo customers. But there are a few minor ones, and in general if we can figure out a way to roll something out online, we'll try to do that.

One approach we were thinking about is to 'clone' the Vault cluster: populate S3 with all the storage entries that actually configured Vault, but ignore entires pertaining to tokens or anything else that's ephemeral. Then, we just needed to slowly move across clients to the new backend. It went like this:

Shut down our service that lets engineers write new secrets to Vault. From this moment on, we can consider Vault's human-supplied configuration to be static - although token creations and renewals continue.
Copy over the Vault storage to S3, excluding a few folders that we didn't need (the ones that relate to login sessions).
Unseal a new Vault cluster that points to S3, with a new DNS name.
Update our configuration service so that newly started applications see Vault as being at the new DNS name.
Slowly evict all Kubernetes pods that use Vault, so their replacements log in to the new cluster.
Write a Kibana alert to ensure that no requests come in to the old Vault cluster

The whole process took about a day. We needed to modify Vault's migration tool to not write a lock into the old§ backend, as we didn't want the existing Vault cluster to be affected in any way. We also modified it to let us exclude certain paths. You can see the changes we made here.

Overall, we managed to migrate Vault without much work and without any downtime. And we're a lot more confident in our availability now that a leadership election can't cause a cascading outage. Given that, we now feel enough confident to continue to migrate Cassandra clients to use Vault credentials.