Skip to Content

How we run migrations across 2,800 microservices

We’ve got a lot of value from our microservices architecture (2,800 and counting!), but this architecture is not without its challenges. One of those challenges is how to make sweeping library changes across all those services. Typically you must pick two of

  • Up to date libraries

  • Consistent library versions

  • Low effort upgrades

Venn diagram with three overlapping properties: up-to-date libraries; consistent library versions; low effort upgrades. There’s question marks in the middle to question whether we can achieve all three of these.

We believe our approach of centrally driven migrations gives us a high degree of all three of these.

Rather than passing the responsibility of migrations onto service owners, we prefer to have a single team drive migrations. This avoids the usual problems of high coordination overhead (leading to slow migrations) and high risk that the project will stall (leading to inconsistency).

To make it possible for a single team to complete the migration within a reasonable timeframe, we lean heavily on:

  • Our core technological choices (e.g. a high degree of consistency, use of a monorepo)

  • Heavy use of automation (e.g. tooling for mass deploying services, automated rollback checks)

To showcase our strategy in action, we will look at a recent project to switch from a deprecated OpenTracing SDK to the OpenTelemetry SDK across all our services.

Setting the scene

All of our services emit trace data to Jaeger. They previously did this via the OpenTracing and Jaeger Go SDKs. Since we introduced these libraries, they have now been deprecated, and the community has consolidated around OpenTelemetry instead. To lay the foundations for improvements to our tracing system, we first wanted to replace the deprecated libraries with the OpenTelemetry SDK.

We’ll focus on this example for this blog post, but we have used the same principles and techniques for other migrations - not just library changes, but significant infrastructure changes too. For example, we recently did a centrally driven migration of our core database system used by the majority of our services. These migrations involve heavy use of automated testing / coherence checking, which is out of scope for this post. For more on this theme you can watch my colleague Suhail’s talk: Running Large Scale Migrations Continuously.

Migration principles

  • Centrally driven migrations that are transparent to service owners. To minimise the coordination overhead and reduce the risk that the migration will stall, we bias towards strategies that can be centrally driven by a single team.

  • No downtime. The majority of our migrations touch services that are critical to the core functionality of the bank, so we can’t accept any downtime.

  • Gradual roll forward, quick roll back. For sweeping changes, we want to be able to roll forward gradually to reduce the blast radius if things go wrong. But we also need to be able to quickly roll everything back if necessary.

  • 80/20 rule when it comes to automation. With a larger migration there is typically a high proportion of changes that fit a common template; these can be easily automated. For the more unusual use cases, automation provides diminishing returns and it’s often more efficient to tackle these on a case-by-case basis. It’s worth classifying the required changes into these two categories up front to avoid nasty surprises and make it easier to track progress.

The migration strategy

Our migration strategy has a series of steps that we use systematically

1. Planning and alignment

These migrations carry a substantial degree of risk: not only do they impact a large number of services, but the team running the migration does not necessarily have much (or any) context on the services they are migrating. While we strive for consistency in our services, there are always edge cases and we want to catch these surprises as early as possible!

For this reason, it’s super important that the planning process is transparent and all engineers have an opportunity to contribute. We have two processes for doing this:

  • Proposals: we write a lot of these at Monzo (even small changes might have a “mini proposal” in Slack). For anything substantial, it ultimately gets shared in a single Slack channel, for anyone in the company to comment on.

  • Architecture reviews: for our biggest changes, we’ll have a synchronous architecture review meeting where we go deeper into specific areas that are more controversial or risky. The goal is to accelerate the project by meaningfully progressing the state of the design, rather than to get approval or sign-off.

As well increasing the chance the project will run smoothly, we find these processes are a great way of drumming up excitement across engineering!

2. Wrap the old library

Alt text: diagram showing how a Monzo service uses the new Monzo tracing sdk, which in turn wraps the deprecated opentracing sdk.

Rather than installing the new library and updating service code to call it, we decided to wrap the old library first, for two reasons:

  1. It allows us to hook into calls to the underlying library and make a decision about which implementation to use based on dynamic config. This allows us to easily roll forward and back without needing to redeploy all services.

  2. There were some types/functions that were significantly different in the new library - it would require a lot of effort to update all call sites, and in some cases the benefit of the new API was minimal. By wrapping the old library it meant we could choose to keep the interface similar to the old library in these cases, making it easier to update call sites.

There are other benefits we’ve found to wrapping libraries, like

  • Being able to instrument them with our own telemetry libraries

  • Being able to provide more opinionated interfaces

We weigh these benefits against the cost of having an in-house API that needs to be learnt and supported.

3. Update call-sites

The usage of this library fit a common pattern: there were a small number of functions/types that were referenced a large number of times across our codebase, and then a long tail of functions/types that were only referenced in a handful of places.

A bar chart show how there are a small number of library functions/types with a large number of references and a large number of library functions/types with a small number of references

We tackled each of these cases differently.

For the small number of functions/types that were referenced in many places, we automated as much as possible. In the case of this library we mostly relied on gopls and gorename to do automated refactoring. You might want to check out go-patch or rf for more advanced use cases.

We took a manual case-by-case approach to handle the long tail of functions/types that were referenced in only a few places. In some cases we manually migrated these. In other cases we realised that it was possible to achieve the same thing using more conventional APIs, so we switched them over. This meant we no longer had to special case them, and had the side benefit of keeping the API of our wrapper library small and opinionated.

In addition to wrapping the old library, we also blocked new dependencies on the old library from creeping in. We did this by adding CI checks using semgrep (incidentally we use this tool for globally enforcing all sorts of conventions across our monorepo).

4. Wrap the new library

Once the old library is wrapped, we can start adding the new library behind our wrapper library.

Initially the new implementation was disabled via config. This means that we can continue to incrementally merge changes to the master branch with no expected change in behaviour.

A diagram showing how an example Monzo service the new Monzo tracing sdk, and how this in turn wraps both the deprecated OpenTracing sdk and the new OpenTelemetry sdk

5. Mass deploy services

Before we can start enabling the new implementation, we needed to ensure that all running services are able to support the new implementation.

For other kinds of library changes, it might be possible to only deploy subsets of services with the new functionality at a time. But with a tracing library, if a service has been migrated to use the new library, then all services it might (transitively) call need to also support the new functionality.

To manage the deployment of large numbers of services, we’ve built mass deployment tooling which allows us to push out library changes across all services as an asynchronous batch job.

To mitigate the impact of possible bad deploys we

  • Use automated rollback checks. You can read more about these here.

  • Deploy least critical services first. We have tagged all of our services with a “tier” - our mass deployment tooling uses this to prioritise least risky deployments.

6. Control rollout via config

The problem with our mass deployment tooling is it’s relatively slow. What we really want to avoid deploying all services only to find out there’s a problem with the new library, and we’re unable to quickly rollback.

So instead of deploying with the new implementation enabled, we deploy with the ability to enable the new implementation via our config system.

The advantage of using our config system here - compared to regular deployments - is it's quick. All our services refresh their config every 60 seconds, which means that we can quickly roll back if we need to.

It also gives us far more control around when the new implementation is used - for example it could be enabled only for particular sets of users, or a random percentage of requests.

In this case we chose to roll it out only for API endpoints owned by our team, and then enabled it based on a gradually increasing probability.

7. Cleanup

Once we had fully switched to the new implementation we had the satisfying job of ripping out the old implementation from our wrapper library.

Migration superpowers

This kind of centrally driven migration was only feasible due to a set of foundational technological choices we’ve made as well as some tools we’ve built and continue to invest in.

Consistent technologies: All of our services are written in Go and use the same version of the old library. This makes it much easier to automate changes. For example we only needed to use a single refactoring tool (rather than one per language).

A monorepo: All our service code is in a single monorepo, which makes it much easier to do mass refactoring in a single commit. It also lets us enforce the use of specific libraries globally in CI checks, helping us maintain consistency.

Mass deployments: With a large number of deployable components we need a hands-off automated deployment process for pushing out library changes.

Lightweight and flexible config service: Our deployment process is safe, but slow (a couple minutes per deployment). We need a more lightweight and flexible process for enabling/disabling new functionality quickly and instantly across a large number of services.

Conclusion

In the past we’ve tried decentralising migrations, but this has inevitably led to unfinished migrations and a lot of coordination effort.

This is why we have a strong bias towards centrally driven migrations; while one team must pay a relatively high price, we spend less effort overall and it significantly increases the chances that we retain consistency.

And this approach creates a virtuous cycle: the team running migrations has a strong incentive to invest in tooling to automate migrations, and also to retain technological consistency (make it easier to build the tooling). However we are still pragmatic about the degree to which we automated things - the 80/20 rule.

In addition to the tooling we continue to invest in, this approach is only feasible due to some of the core technological choices that we made at the start - primarily the fact we use an opinionated and limited set of technologies.


Does running migrations at scale and without downtime interest you? We are hiring for Backend Engineers, Staff Engineers and Engineering Managers!