Securing admin access to Monzo’s platform

Read the article

Monzo’s microservices are deployed on a shared “platform” of infrastructure, libraries and tooling. This allows our product teams to focus on building features for customers while our platform and security teams ensure everything is running efficiently and securely.

We regularly deploy improvements to our entire fleet of 2500+ services, often without the teams that own those services even noticing the change. Taking examples just from Security, we’ve written before about how we introduced RPC permissions and egress controls for all of our services. A shared platform is a fantastic force-multiplier but it has corresponding risks: the admin permissions needed to make these platform-wide changes are sensitive and must be guarded carefully.

This post describes how we combined confidential computing and reproducible builds into a system for brokering access to temporary infrastructure credentials. We’ll also discuss how we made this system resistant to the most determined attackers – including engineers in the team that maintain it!

Multi-party authorisation for infrastructure changes

Monzo’s core infrastructure consists of a small set of Kubernetes clusters hosted by AWS. Making changes to this infrastructure is a sensitive process: it requires access to highly-privileged credentials and the potential blast-radius of an incorrect change is often high.

A simplified view of Monzo’s backend, showing example microservices (service.ledger and service.pot) as built on top of a shared “common platform” of Kubernetes clusters and AWS services. (Source: https://whimsical.com/monzo-s-platform-BWi7MFVycewC5Mfw8EfSvT)

One good way to mitigate this risk is to delegate as much of it to machines as possible. Fortunately, we want to do that anyway! For example, we run a Terraform-based continuous deployment pipeline for a lot of our infrastructure. All changes made by this pipeline must first be approved by other engineers in code reviews.

Unfortunately this can only take us so far: sometimes a change must be made for which no automated process yet exists. This is often the case when we need to make unplanned changes quickly in order to resolve an incident. In these cases, we want a pair of engineers to be able to temporarily escalate their privileges together in order to be able to make changes to our infrastructure.

The process of allowing a group of engineers to perform some action together is called multi-party authorisation (MPA). This is a great way to empower engineers to move quickly and securely. We’ve written about several of our existing MPA systems on this blog before, for example when enforcing an approval process for RPCs to sensitive endpoints.

Applying an MPA workflow to infrastructure permissions is straightforward: we can write yet another microservice for it! Engineers make requests to this broker service to gain some short-lived credentials that they can use to authenticate with some other system (e.g. an AWS account or a Kubernetes cluster). These requests are then approved by a different engineer, allowing the requesting engineer to fetch and use their temporary credentials.

Sequence diagram showing two engineers gaining temporary AWS admin credentials by interacting with a broker service. (Source: https://whimsical.com/mpa-interaction-D55grdX3XTAqYFRAntpNUf)

The broker service just needs some way to generate short-lived credentials for the system it safeguards. Both AWS and Kubernetes have built-in support for this: AWS has the Secure Token Service and Kubernetes supports OIDC tokens. In each case, the service is granted some long-lived seed credentials (e.g. an OIDC private key) that it can use to mint short-lived credentials for use by engineers.

Engineers interact with the broker using an mpa command-line tool:

Screenshot demonstrating the command-line interface of the MPA system. The flow is split into three phases: (1) Alice requests credentials with mpa request (2) Bob approves this request with mpa review, and (3) Alice fetches her credentials with mpa fetch and uses them. 

Phase 1 is picturedScreenshot demonstrating the command-line interface of the MPA system. The flow is split into three phases: (1) Alice requests credentials with mpa request (2) Bob approves this request with mpa review, and (3) Alice fetches her credentials with mpa fetch and uses them. 

Phase 2 is picturedScreenshot demonstrating the command-line interface of the MPA system. The flow is split into three phases: (1) Alice requests credentials with mpa request (2) Bob approves this request with mpa review, and (3) Alice fetches her credentials with mpa fetch and uses them. 

Phase 3 is pictured

These requests are all audited via our security event pipeline and surfaced in Slack for visibility over how our privileged credentials are being used:

Ascreenshot of a Slack thread showing an engineer requesting Kubernetes credentials as part of remediating an incident.

Bootstrapping the security of our platform

The purpose of the MPA service is to protect its own seed credentials and restrict engineers to using derived short-lived credentials (and only after getting approval from another engineer).

There’s a snag though: how do we defend against privilege escalation attacks from maintainers of the system itself? If we deployed this service directly to our platform, an attacker with temporary platform-level admin permissions might be able to escalate their privileges in one of two ways:

  1. exfiltrating secrets from the actively-deployed system, for example by reading the memory of the active deployment via the host machine;

  2. changing the system in some way to weaken its security properties, for example by deploying a malicious version of the service that sends secrets directly to the attacker.

We required some way to bootstrap the security of the system, creating a service that is truly more trusted than any of the engineers that maintain it. This meant we couldn’t rely on the security of our existing platform abstractions (Kubernetes worker nodes, secret storage etc.). Despite this, we still wanted the system to look and behave like a regular Monzo microservice: it needs to be understandable and maintainable by engineers that are used to working with our platform.

Confidential microservices

We defend against exfiltration of secrets from the running service by deploying it in a very isolated environment: an AWS Nitro enclave. These are hardened EC2 virtual machines with a reduced attack surface. In particular, they have no direct access to persistent storage or external networks.

The only way to communicate with a deployed enclave is via a virtual socket on the EC2 instance designated as the enclave’s “parent”. The parent instance is able to interact with the enclave via this socket but can’t otherwise tamper with it (e.g. by reading its memory). Once a trusted workload has been deployed into an enclave, its memory is confidential to everyone else – even admins in the host AWS account!

We run an Envoy proxy on the parent instance that tunnels traffic to and from the enclave without terminating TLS, allowing the enclave to interact with engineers and AWS services without being vulnerable to meddler-in-the-middle attacks in the host AWS environment. The enclave itself runs a standard Typhon RPC server – written like any other Monzo microservice – supplied in a custom “enclave image” format when the enclave is first started.

Diagram showing the flow of requests from to and from the Nitro enclave via the parent EC2 instance. Requests are tunnelled through an Envoy proxy on the parent instance. Traffic forwarders on the parent instance and enclave bridge TCP traffic through either side of the virtual socket. (Source: https://whimsical.com/mpa-network-diagram-4CKDQH4NYLK6CB6ou96HwL)

Authenticating as a trusted workload

It’s not enough to ensure that the deployed service is tamper-resistant: we also need to verify that the service we deployed into the enclave is trusted in the first place before we grant it access to any secrets – in particular, the seed credentials that it uses to mint short-lived tokens for engineers. It’s crucial that these are only ever accessible by known-good enclave images, otherwise an attacker could “upgrade” the service to a malicious version to gain access to the secrets. This is made possible by another feature of Nitro enclaves called cryptographic attestation.

When an enclave first boots, the AWS Nitro system takes various measurements of the running workload, including a hash of the entire kernel and user-space binaries. The enclave is able to request these measurements from an attached device (”Nitro Security Module”) in the form of an “attestation document”. This document is associated with an ephemeral key owned by the enclave and is signed by AWS’ public key infrastructure, allowing any third party to verify the workload running in an enclave and return encrypted secrets that are readable only by that enclave.

By itself, this doesn’t achieve anything: we still need some other system to exchange this authentication factor for a set of secrets that the enclave can use. We can’t rely on our standard platform secret storage solution for this as it would introduce a cyclic dependency that might allow a platform administrator to subvert the MPA process.

Fortunately, the AWS Key Management Service (KMS) supports Nitro attestation documents natively for its Decrypt operation. This allows an enclave running a trusted workload to decrypt its own secrets by sending them to KMS along with its attestation document. An untrusted enclave will never be able to access the service’s secrets because it can’t generate an attestation document with the correct measurements.

Diagram showing how an enclave is able to authenticate as a trusted workload. The enclave-deployed service fetches a signed attestation document from its attached Nitro Security Module, then presents this to AWS KMS in exchange for decrypting its secrets. (Source: https://whimsical.com/mpa-trusted-compute-8fe6fuZQLpmwA89BYoNSeW)

Reproducible builds of enclave images

The final piece of the puzzle is in working out which workload hashes should be trusted to begin with. We needed a process that would be resilient to any one engineer attempting to inject bad code into the system.

Our enclaves are pure Go binaries, meaning it’s straightforward to reproducibly compile them. Achieving this required re-implementing certain AWS-provided Nitro enclave SDKs in Go to avoid depending on external build toolchains that are hard to make reproducible. With a reproducible build process, we can have two Security engineers start with a Git commit that they both trust and independently reproduce the same workload hash. Once they’ve done this, they use the MPA system itself to deploy a new version of the service with the agreed hash.

Putting it all together

We’ve had this system running in production for over a year now, using it every day for business-as-usual changes and incident remediation. Thanks to MPA, no engineer can gain privileged access to our infrastructure without approval from another engineer – this is a huge achievement 🚀

We’ve also benefitted a lot from increased observability into how often we’re reaching for privileged credentials and why. This helps us prioritise building systems to automate error-prone manual processes and nudge engineers away from relying on their own credentials to make changes. Now that we have a framework for deploying very trusted workloads inside enclaves, we’re excited to see what other processes we can automate – stay tuned!

Are you interested in joining Monzo?

We have some great opportunities at the moment across the whole business. We currently have open roles for Backend Engineers and Senior Backend Engineers, and plenty more roles you can check out on our careers page.