Platform Engineering means very different things at different companies. Your work might span integrating external vendors, deploying open-source projects, writing infrastructure as code and maintaining scripts to automate operations.
At Monzo, Platform Engineering looks a little different. We design APIs and write services that provide functionality in the platform for other engineers and systems to leverage. We think about reliability, ownership and test coverage in the same way that product teams do - except our users are Monzo engineers and our product is the platform
In this post we’ll walk through how and why we approach platform tooling this way and share some concrete examples of the systems we’ve built.
Building on top of the platform
We are big advocates of open source tooling, and we use lots of open source technologies in our platform, but tools designed to work for many organisations come with constraints and we sometimes find that we need to write our own service to solve a problem specific to Monzo. We build our own APIs and higher level abstractions around those technologies so that we push the operational logic into a service owned by the platform team with a clear (and often opinionated) interface. This improves operational toil for the team and reduces the chance of mistakes by moving operational logic into code.
One of the things that makes our approach possible is the level of consistency in our backend. All of our backend services live in a monorepo and follow a set of common patterns for how they’re built, deployed and operated. It’s really easy to build a new backend service at Monzo so it makes sense for us to use the tooling to build platform services. We also benefit by becoming users of the platform ourselves - our services run on the platform that we operate, so this keeps us close to the experience of the engineers and helps us to understand how to improve the experience of using the platform. We routinely build common libraries for interacting with our platform and we use these libraries in our platform services as well.
Writing backend services for the platform also helps with knowledge sharing and makes it easy for engineers to move between teams. Engineers become familiar with a common stack for building services which applies universally across Monzo.
One of our most common use cases for a backend service is to automate an operation on our platform. Our backend RPCs already have protections over who can call which endpoints and the ability to require a secondary user to approve your request (“multi party auth” or MPA) so we can use this to build powerful handlers we can use with approval. Instead of having a bunch of scripts to do common operator functions, we can make a quick RPC call, someone else can approve that action, and our service will execute the steps itself.
One example of this is a service called service.karpenter, a service for interacting with our Kubernetes cluster autoscaler Karpenter. This service has an endpoint which allows us to quickly toggle Karpenter’s disruption functionality per nodepool, if during an incident or routine maintenance we need to reduce our level of pod churn or pause a rollout, any engineer in our squad can just make a simple RPC request and the service does the rest.
Having an RPC with MPA instead of just access via kubectl makes it easy for engineers without the same level of familiarity to Karpenter to perform tried and tested actions a lot easier. It allows us to automate large portions of our runbooks.
Writing services to run through infrastructure migrations
We have also built services to allow us to migrate other services. When we were moving all of our service workloads from self-hosted Kubernetes to EKS we built a backend service to run all the various steps to automatically migrate a service. This service could be given a list of services at the start of the day and it would migrate them throughout the day. It had stages to check various health metrics and could even rollback migrations if it got stuck. This migrator service was only possible as a lot of the platform actions it needed to take: shifting traffic between clusters, scaling up/down a service and viewing metrics/logs were also already built for it, it could just call different RPC endpoints and perform these actions. This service migrated >3000 of our microservices to our new EKS cluster and it even migrated itself.
At Monzo, platform engineering is not just about deploying tools or writing scripts. It is about building a robust, reliable, and consistent platform experience for all Monzo engineers. By treating platform tooling as a product, and by shipping services with clear APIs, we get fine-grained control over platform operations without turning them into bespoke, error-prone runbooks.
The foundations of our platform provides benefits such as centralised metrics and logging, automation and durability and this sits alongside our change management process and testing. Our tooling powers all of these platform primitives and helps us to both extend the platform into new ones, and improve existing ones at pace. For example, when we hold incident debriefs we often think about how we can refine these control loops and make them more robust.
If the idea of building, owning, and enhancing the foundational services that power a major bank, while operating at scale and solving unique engineering challenges sounds like a place you’d thrive, then we encourage you to explore an Engineer role!