Argo Rollouts at scale: Bringing Automated Rollbacks to 2,100+ services at Monzo

Read the article

🗣️ This blog post is a complement to a talk we gave at ArgoCon2022. You can watch the recording of the talk for more information.

We’ve massively invested in the microservices philosophy and ship small and often. We have over 2,100 services and deploy over 100 times per day. In fact, we deployed to production more than 27,000 times 2021 🚀

We have great tooling and a fast deployment pipeline, yet our deployment strategy was still trailing some of the modern best practices. We were too reliant on testing, alerting, and human discipline and we wanted to raise the bar.

We introduced automated rollbacks to our platform using Argo Rollouts and made it the default deployment strategy for all our services. This helps us roll out changes safely and catch a wide class of problems related to deploying new code before they have much impact. This means customers will see fewer issues and our engineers will spend less time worrying about deployments and more time delivering value.

In this post, we’ll focus on how we implemented and rolled it out (pun intended!) to all of our services!

We didn’t want engineers to babysit deployments

At Monzo, engineers trigger deployments by running a CLI command. We realised that although we do have a lot of mechanisms to catch bad changes at multiple levels, we relied on alerts being routed to engineers to react to potentially bad deployments. This was less than optimal because:

  • Humans notoriously miss stuff all the time; discipline is not a good alternative to automation

  • It was time consuming and added some cognitive load to the act of deploying, which was the very thing we wanted to make as frictionless as possible

  • Rollouts were "all or nothing": provided pods start, a bad release could be rolled out to all pods before an engineer would be alerted

Longer term we would like to eliminate manually-triggered deployments and instead trigger all deployments from merges to our main git branch (also known as “GitOps”). We believe automated rollbacks are a requirement for moving to a GitOps deployment model because the deployment flow is more asynchronous, which makes it even less appropriate for engineers to babysit deployments.

What we needed

Ultimately we needed a system to detect bad deployments and automatically roll back. Alongside this, we had some important additional requirements.

We wanted the new system to be as transparent as possible to engineers. At Monzo we pride ourselves on our simple and opinionated tooling. We provide tools for managing deployments that are specialised to our services and operate at a higher level of abstraction than the backing systems, such as Kubernetes. We wanted to preserve this with automated rollbacks, so whichever technology we picked had to be flexible enough to integrate deeply with our existing tooling (e.g. our command line tools for managing deployments, and Slack).

We wanted the automated rollback system to be based on Prometheus metrics. Our engineers are already familiar with Prometheus metrics and alerts, so we wanted automated rollbacks to be configured in a similar way. We have a number of generic alerts that apply to all services, so we could get significant coverage with low effort by configuring rollback rules to trigger on the same thresholds as our generic alerts. We also wanted to enable engineers to configure their own service-specific rollback rules in the future.

We wanted a system that allowed us to easily add new deployment strategies in the future. Our initial release would just deliver automated rollbacks, but we wanted a system that could easily be extended to support progressive rollouts in the future. i.e. where we could automatically roll out a change to increasing subsets of traffic, all protected by automated rollback if errors were detected.

How we implemented automated rollback using Argo Rollouts

We weren’t fundamentally opposed to building support for this into our existing custom deployment system, but we decided to pick Argo Rollouts because it satisfied all of our requirements and is becoming increasingly widely adopted within the Kubernetes community.

Argo Rollouts is a Kubernetes controller that supports multiple - highly configurable - deployment strategies including full progressive rollouts. Despite this flexibility, we initially just wanted a basic canary deployment strategy that quickly rolled out all pods, followed by an analysis of metrics and potential automated rollback. This approach allowed us to deliver the user-facing changes incrementally.

The new deployment flow with Argo Rollouts looks like this:

This flowchart describes the deployment flow for Argo Rollouts

Argo Rollouts has built-in support for Prometheus metrics when it performs an analysis on a deployment. We configured a global set of rules that would trigger rollbacks based on the same metrics and thresholds we use for our generic alerts. Currently these are things like

  • The total RPC error rate is >20 %

  • Any panics / crashes

  • Large spikes in database errors

We needed to be able to differentiate new and old pods in a deployment in the Prometheus queries. We achieved this by tagging the pods and metrics with a phase label using Argo Rollouts’ Ephemeral Metadata feature.

Engineers now trigger deployments from the command line in much the same way as before. The biggest change to user experience is the shift away from a synchronous deployment process. Whereas previously the CLI tooling would block until a deployment is complete, it could now take >5 minutes for all of the analyses to run. We needed to rethink how we notified engineers of results.

We chose to use Slack to surface these notifications and we used the Argo Rollouts Notifications feature for this. Specifically, we use the generic webhook integration to forward notifications to a custom service, which enriches them and forwards them to Slack.

The notifications looked like this:

Diagram showing the deployment started Slack message and two kinds of followup message: a successful case and a failure case.

Installing and configuring the system was relatively straightforward. The big challenge was the migration!

Getting to a proof of concept stage was quick; migrating all 2000+ microservices was not. Each service required a new Kubernetes resource (a Rollout), and this wasn’t something we felt comfortable doing as a single one-off process. What if we hit scaling limits of Argo Rollouts once we hit a certain number of Rollout objects? We needed to be more cautious. We needed

  • a process to gradually and automatically migrate services

  • a process to revert a migration for a given service if required

  • support in our existing tooling for both migrated and un-migrated services

So we built a new migrator service to migrate services in batches every day during the migration period, first in our staging environment and then at a slower cadence in production. All of our services have a “tier” which defines its criticality - we used this information to migrate the least critical services first.

Engineers enqueued services to migrate, and a regular cron job then triggered the actual migration. By using a queue to drive this, we could take advantage of retries, and could also reuse the same pipeline to revert migrations if we needed to.

A description of the Migrator Pattern we use at Monzo. A cron triggers a Migrator Service and we use a queue to gradually roll it out

We’ve successfully used this migrator pattern for a number of migrations at Monzo.

Normally Argo Rollouts requires Rollout objects to be used instead of vanilla Kubernetes Deployments. This makes migration (and reverting a migration) tricky because it’s destructive: you have to replace one resource with another. Luckily we could avoid this problem by making use of an Argo Rollouts feature specifically for migrations which lets you add a Rollout resource that references an already existing Deployment.

The actual migration process for a single service looks like this:

Migration process flowchart which outlines the steps required to create a rollout, scale it out, make sure we don't have any K8s deployments left, point the HPA to target the rollout object

We used this process to migrate all services in our staging environments followed by production without any downtime.

Lessons learned

Don’t underestimate migrating a large number of services and integrating with existing tooling. It’s easy to introduce a proof of concept automated rollback system, but integrating with existing tooling is really tricky and there are lots of edge cases. For us, migration and tooling work was >90% of the effort.

Heavily automate the migration process, and migrate as quickly as possible. We initially wrote a script for migrating services, but quickly realised that for 2k+ services, the throughput, resiliency and visibility were not sufficient. When you’re performing the same lengthy process >1000 times, something like the migrator service we described above makes it a lot less painful. Another option is to decentralise the migration: e.g. provide service owners with the script and let them do the migration. We advise avoiding this if possible. We’ve tried this for other migrations and they have stalled indefinitely.

Decouple UX changes from the introduction of a new internal system. We were tempted to support full progressive rollouts from the outset because Argo Rollouts supports it. But as we dug into the project we realised this would be a relatively disruptive change to developer’s workflows. In hindsight we could have reduced risk by initially introducing Argo Rollouts with no change to deployment behaviour by simulating the vanilla Kubernetes deployment strategy. Then we could have introduced Automated Rollback as a subsequent project.

Expect issues with less mature third party systems. Argo Rollouts is less mature than Kubernetes, for example. As a clue, the resources have “alpha” in the name! We found a few bugs and limitations because of this. You need to be aware that by selecting a less mature system you’re taking on more responsibility and will likely have to contribute back to the project. This isn’t necessarily a bad thing with open source systems.

What’s next

Since landing this project, it has saved us from a number of bad deployments to production and there's been good feedback from engineers that it's a transparent but useful safety net.

Now that we have the foundations in place, we can start exploring more advanced rollout strategies: we’re particularly excited by full progressive rollouts, where a change is gradually rolled out to pods while analysis is run. We’d also like to provide a way for teams to configure their own rollback rules.

Ultimately our vision is to move to a GitOps model for deployments, and automated rollbacks is a step in this direction.