Since the early days of Monzo, we have built our banking platform around a microservices-based architecture, using mostly containerised workloads that are distributed and redundant. We run our infrastructure on AWS, and as an early adopter of Kubernetes, we have been able to efficiently scale a banking platform consisting of more than 20,000 containerised workloads across more than 2000 microservices to date.
In this blog post, we’ll talk about the principles our Security Infrastructure team follow to build security in Monzo’s fast-moving engineering environment, how we robustly apply these principles in practice, and work with other engineering teams to keep our platform and customers safe.
When engineering teams introduce new infrastructure components into Monzo’s banking platform, our security engineers work with them to perform threat modelling as a collaborative exercise. It’s important to understand new risks, and agree on necessary security controls which both keep us safe and continue to allow us to move fast. We use the STRIDE model paired with OWASP Threat Dragon to draw diagrams and document our models.
Once new infrastructure components are introduced into the platform, security engineers will conduct regular threat modelling sessions for larger parts of our platform as a whole. This helps us keep threat models updated as time passes.
We condense the threat models for different parts of our platform into higher-level risks and necessary security controls, and derive our security design principles from that. Monzo security engineers then in turn follow these principles when advising engineering teams or making key security designs.
Don’t trust workloads by default
When we deploy a new microservice, it cannot talk to other services or access the internet by default. It has no access to a database, secrets or other resources on the network. Access is explicitly granted. We have Kubernetes Network Policies for traffic between all our microservices, we apply network and/or application access rules on all shared resources in the private network, and we allowlist egress to the internet on a DNS-name basis and at a microservice-level.
We make sure these controls scale, both in terms of platform performance and developer productivity. For example, we now have more than 2000 microservices running on our platform. Since our deployment pipeline analyses service code on the main branch automatically to figure out valid network paths to other services and the internet, there is no more security overhead now compared to what it was in 2019.
For data and secret access, we have templated policies where each service can access resources namespaced to their name. For example, a Vault policy might look similar to this:
The same is also used for assigning our services unique cryptographic identities via a private public-key infrastructure (PKI), supported by strong hardware security measures.
For internet-facing traffic we receive, only load balancers terminating traffic for the backend endpoints they serve have public IP addresses, and firewall rules for subnets and resources ensure that they are the only components in our private network that can receive internet traffic. We are then able to focus on securing just a small set of network ingress paths.
We authenticate connections from Monzo staff to all internal systems and infrastructure using their linked staff profiles before allowing access to any private network interfaces. All accesses to our internal systems require multiple authentication factors, which are less likely to be compromised at the same time. We are also careful about the toil that security measures generate. An example of this is how we improved our staff VPN system using a combination of client credentials secured on laptops and a dynamic factor using mobile devices.
For all network access controls we implement, we also ensure we have good visibility of these controls in action. We export metrics and set up alerts to flag situations where the controls may be misbehaving or there may be indications of an attack, and these feed into our on-call system to ensure that security engineers respond quickly to potential issues.
Automation, automation, automation
We highly prefer automating infrastructure changes to having engineers making manual changes. Our change management policy dictates that we require at least one other person to review a change, meaning if we allow manual changes from a laptop it’s much harder to enforce such policy.
For example any changes to AWS have to go through peer review, then changes would be automatically applied with Concourse to our staging environments. Once approved, merged and verified, engineers can trigger the same change to apply to production.
We strive to provide self-service security as much as possible at Monzo, by giving engineering teams tools and guard rails so that they could implement security without us being a blocker. For example, if an engineering team wants to create a new pipeline in Concourse delivering changes for another part of our infrastructure, there are well-defined guidance and templates for them to follow, which will then only require review and approval.
When human interventions are required, often due to an incident, we have “break glass” tools which allow us to respond quicker. We’ll talk about these later in this post.
Technical controls to back security policies
We prefer technically enforceable policies over paper-driven processes. Paper policies inform us what our technical controls should be, but they can’t be relied upon to prevent a security incident. When we mention that an action requires multi-party authorisation, we mean technical controls are in place to ensure the action can’t be performed unless it has been approved. This follows the “policy-as-code” principle.
Our change management policy requires at least one engineer to have approved changes to production, so we’ve made our deployment pipeline require review from the owning teams and the PR to be merged before allowing deployments.
We’ve scaled this beyond engineering. Sensitive actions by our customer operations team and other parts of the company also require multi-party authorisation.
Log everything and make use of the logs
Each layer of our infrastructure produces audit logs, for example CloudTrail events from AWS and Kubernetes audit events from our Kubernetes clusters. Various network components of our infrastructure also generate their own access logs or traffic logs. Across all audit sources, we log all actions by both human users and service accounts, with the highest level of detail recorded on actions that potentially impact the security of our customers’ money and data. Components handling data for the pipeline implement append-only permissions to protect the integrity of audit logs and logging systems.
We have pipelines delivering audit events into a centralised system for detecting potentially suspicious patterns, which sends alerts to our security teams for further investigation. Because we build and manage most of our own event pipelines, we can enrich the stateless audit events referencing specific resources or users with additional contextual information from the infrastructure that are related to their subjects. This information allows our alerting system to make refined decisions and improve the signal-to-noise ratio of our alerts.
The advantage of having a centralised system for analysing events is that we can easily integrate events which are not directly linked to actions by humans, and process them in the same way as we handle audit events generated from sources such as AWS CloudTrail and Kubernetes.
Finally, when it comes to auditing shell sessions in our production compute instances and Kubernetes workloads, we require more than the typical metadata-level logs on who accessed which server at what time.
Engineers at Monzo make use of extensive automation and internal tooling to deliver backend changes safely and more efficiently without infrastructure access. Direct infrastructure access is limited to exceptional situations only, and when we allow engineers access, they have to do this via Teleport as an auditing proxy to preserve full visibility over their actions:
Breaking the glass safely
Our banking platform works reliably the vast majority of the time, and so do the infrastructure access control systems running on the platform. But when there’s a problem, our engineers will need to raise an incident and fix it as soon as possible. Our security controls work in a “fail-closed” manner. So if the infrastructure access systems go down with the platform, they will not expose any additional access compared to when they were up.
However, we do need to let our platform engineers go in and fix the platform when this happens, and to do this we have implemented a set of backup access systems we call “break glass” systems. These systems provide us with the ability to access infrastructure components and internal tooling in an emergency, and are tested regularly by our engineers to ensure we can count on them in the event of a major incident.
Break glass systems need to have minimum dependency on other platform components so that they can work reliably in an incident, but this sometimes means that we can’t implement as many controls and logging systems as we would like for the regular access systems. However, we always ensure that when someone uses a break-glass access system, it will:
trigger very loud alerts, and will immediately page security on-call engineers
require multi-party authorisation if necessary
provide the same level of audit scope and ability to identify the user as using the regular system
Permissions are least privileged
Internal applications use role-based access controls where each role is scoped to a particular responsibility at the company. As a customer support specialist you may have dozens of roles based on your area of expertise and training, same applies to engineers.
For SaaS products we aim to achieve the same where possible. Being one of our most critical SaaS products, we write as tightly scoped IAM policies as practical for our AWS accounts; however not all SaaS products have a policy language. For those products we still try to align permissions assigned within them to be as close to least-privilege as possible.
We hold access review sessions at a regular cadence, during which each team responsible for a SaaS product or an internal product will review users of such system, their roles and associated policies. The process is highly structured and audited at least once per year.
No shared credentials
We require personal accounts for each staff so we can attribute actions to individuals. Where available, we enforce hardware tokens due to their many security benefits, including the ability to enforce our policy on no credential sharing internally.
This applies to internal applications such as our customer support portal, but also to all of our SaaS products that support a Single Sign-On (SSO) protocol. We use a centralised Identity Provider (IdP) which we configure to always require multiple factors, including a possession factor such as a hardware token device.
This blog post provides an overview of the different principles we follow to protect the lowest layers of our banking platform. The Security Infrastructure team is just one of the many teams at Monzo who work on different parts of our banking platform to keep our customers safe.
If you’re interested in building infrastructure as well as security, we are always looking for people who would be a great addition to our team, including backend engineers.