At Monzo, we constantly strive to provide industry-leading reliability for our customers. But we let our customers down on the 30th May last year, when we experienced problems making and receiving bank transfers, leaving people feeling rightly frustrated.
At the time of the incident, we relied on a third party provider to operate the component which connects Monzo to the UK bank transfer network. This component is known as the gateway. As an outcome of the incident, we promised you that we’d build our own gateway within our internal systems. We recognise that operating our own infrastructure and systems is the key to reliability, and we needed to establish more specialised expertise into Monzo.
Our in-house gateway has been running since November 2019, and we wrote a blog post earlier announcing the successful outcome of the project. In this post, we want to give you some insight into how we approached this as an engineering challenge.
How does Faster Payments work?
Fig 1. A high level view of Faster Payments infrastructure
To help you follow along with this blog post, there are several key components within the Faster Payments infrastructure that are useful to know about.
Faster Payments Scheme (FPS)
The Faster Payments Scheme (FPS) is a domestic bank transfer network within the United Kingdom that facilitates real-time payments between customers across a range of banks and financial institutions. We call these parties participants.
All participants of FPS are connected to an external, centralised system that we call the “hub”. The hub keeps track of payment states and delivers payment messages between different participants. Participants connect to the FPS hub directly, and not to other participants.
The gateway refers to software that sits between a participant’s payment processor, and the FPS hub. It’s generally responsible for connecting a participant to the hub. A gateway establishes and maintains network connections to the hub, and it might also perform some processing of messages, like converting the structure of the message between what the hub provides and an internal format that a participant uses for payment processing.
This is the system that will either debit or credit an account, or reject a payment, based on a set of messages that we send and receive, and the state of an account. When you receive a bank transfer, our processor will perform a series of checks before crediting the amount to your account. Similarly, when you request an outbound bank transfer, the processor makes sure you have enough money before proceeding with your request.
“Store-and-forward” is a term describing a system that keeps messages it receives, and then sends them at a later time to their destination. Stand-in refers to how a gateway may provide store-and-forward functionality. Not all gateways have stand-in functionality. But if they do, this means the gateway has the ability to respond to payment messages on the participant’s behalf. This allows asynchronous forwarding of payment messages to the participant’s processor at a later date.
Stand-in is useful for periods when participants might need to perform maintenance that makes their payment processor unavailable. It’s also used as a mitigating mechanism for unexpected issues, such as a network disconnection between a gateway and processor. The gateway can provide stand-in responses telling the sending participant that the payment will be credited at a later time, and then when the payment processor is available again, the payment request can be applied to a customer’s account.
Why were we using a third party?
For nearly three years, we relied on a third party provider’s gateway for our connection to the FPS hub. We chose to use a third party in 2017 as it let us offer bank transfers to our customers quickly. By using a third party, we could get complex components of the gateway abstracted away and managed for us. Such components included specific cryptographic requirements and stand-in functionality, which our small team at the time didn’t have much experience building.
Fig 2. How the third party gateway fit into our connection to Faster Payments
Why did we decide to bring the gateway in-house?
Our third party provider was a robust partner that provided a reliable service for our customers. And they made it possible for us to provide core bank transfer functionality for our customers quickly and easily.
But as we grew, we started to push up against the operational limits of our provider. So last year, we decided to replace our third party gateway with our own gateway that we’d build ourselves.
We’d gained three years of operational experience and expertise with Faster Payments, which put us in a good position to build our own gateway. And, as we wanted to be in more control of incidents, if and when they occured, we thought bringing the gateway within Monzo would let us react more effectively to future issues.
How did we do it?
This project was different from the work we usually do at Monzo. There were gaps of knowledge in our team we had to understand and develop systems for, as well as new external processes to follow. This post will cover just a few of the things we built out-of-the norm for this project.
Operating in our own data centres
Before the project, we only operated applications within cloud infrastructure provided by Amazon Web Services (AWS). And we had a few physical data centres with only the necessary networking equipment to connect us to external payment systems.
Our data centre equipment only consisted of routing and switching hardware, and we weren’t running any applications in our own data centres. A key part of this project was designing and building our own set of servers in data centres, so we could run our gateway inside our data centres. This part of the project was led by a newly formed specialist data centre squad within our Platform team.
Fig 3. Our previous setup with the third party gateway
Previously with our third party gateway, the system was designed to be in an active-standby configuration.
This meant only one data centre was the active data centre at a given time, processing all payment messages. The other data centre was on standby in case the primary became unavailable. It needed around 15 minutes’ downtime to “flip” payment messages to the standby data centre by our on-call engineers.
Fig 4. Our setup with an in-house gateway
With our new in-house gateway, we decided from the start to build it with an active-active configuration. This would mean payment messages can always be routed through either data centres at any time, and we wouldn’t need a failover if a data centre became unavailable. This would reduce customer impact to almost zero if we experienced issues isolated to a single data centre.
Our connections from each of our own data centres to the FPS hubs have redundancy. Each of our data centres has two physical connections, one to each of the two hubs. This means that we can continue to operate with three out of four connections becoming unavailable. If a connection is interrupted, payment messages are routed through the other data centre without any downtime.
We wanted to build a gateway that could continue to operate, even if the connection to our payment processor within AWS was interrupted. This is known as “stand-in” functionality and is a type of store-and-forward log. We’ll cover this more later. By building stand-in, this added further resilience and let us defer work to increase redundancy of connections to AWS.
Within our data centre, we focused on building a highly redundant storage system on the back of the Distributed Replicated Block Device (DRBD) system. We have three servers per data centre, each with two solid state drives (SSDs) within the DRBD cluster. Each write is written to six SSDs for maximum durability. A durable storage system underpins the gateway, and we valued durability over performance.
Fig 5. Replication of messages we send and receive at the gateways. We can survive an entire server failing, as well as one disk from each machine failing.
The same servers are made available for running applications, and all applications run in an isolated virtual machine. We made the conscious decision to avoid complex consensus algorithms and instead chose to rely on the redundancy of having two data centres in situations. Our gateways run on predetermined servers, and can be migrated to other servers should a server become unavailable.
Building a replicated monolithic gateway
Monzo is known for our extensive use of microservice architecture.
But we eschewed that approach when it came to applications in the data centres. This is mostly because of our decision to avoid consensus within the data centre and hence ruling out cluster-based service orchestration like Kubernetes. Instead, we built a little monolith, and replicated this monolith in a way to allow for greater levels of redundancy.
Fig 6. Theoretical layout of replicated gateways within a single data centre
Within Faster Payments, there’s a notion of “payment types”, and each type of payment is a separate TCP connection to the hub. This let us deploy several copies of our gateways: one for each payment type (Fig 6). Each gateway is a vertically integrated application.
At one end, it handles complex connection logic to the FPS hub.
In the middle, it durably writes messages to disk, and performs authentication.
And at the other end, it forwards payments to and from AWS, and responds on our behalf with stand-in for messages that can’t be forwarded immediately.
This approach lets us combine several benefits of both types of architectures.
With a monolithic architecture, we avoided having to orchestrate service discovery and network calls which are more prone to failures than in-process communication.
With a replicated architecture by running multiple monoliths, we can survive the failure of multiple instances of the gateway with zero customer impact.
Like other parts of our platform at Monzo, we built our applications with Go. The benefits we get from Go at the microservice level also apply to applications in the data centre. Small binaries are easy to deploy, and standalone binaries reduce the need to configure the system with dependency management systems. The lightweight runtime of Go lets us share system resources across many applications, both now and in the future.
Having an active-active gateway setup means we can make changes to the gateway applications without any impact to our customers. This is an important difference from our previous gateway, as we frequently had to schedule in maintenance windows where bank transfers could be unavailable or degraded for significant periods of time. Now, we can deploy our gateway with zero downtime.
Our data centre applications are orchestrated with Docker and Docker Compose, and we run this within separate virtual machines on the 3-server setup. Using Docker and Docker Compose lets us manage processes easily, with automatic restart on failure, as well as an easy way to template different configurations for running the gateway applications. Using virtual machines also lets us maintain isolation between different parts of the gateway, as well as isolation between other applications unrelated to FPS systems in the future.
Building a more reliable stand-in system
Building off the previous section where we designed a resilient and replicated server setup in a data centre, we used this to implement a store and forward system. This is commonly referred to as “stand-in” functionality: our gateway can issue a stand-in response to the hub if it can’t successfully get a response from our payment processor for any payment request. The stand-in response tells the sending participant that this payment won’t be immediately credited to our customer, but instead at some point in the near future.
The logs that back our stand-in system are built within the same DRBD system mentioned earlier. All messages are replicated within six different disks across three machines. We can survive the loss of one disk in each machine, as well as the loss of an entire machine without any interruption. There are some throughput losses associated with replicating data across six disks, however we’re comfortable with this tradeoff after extensive load testing at current and predicted peak volumes.
We’ve improved the behaviour of stand-in. With our previous third party gateway, stand-in was a blunt switch. The old stand-in behaviour only kicked in after a period where our payment processor failed to respond to payment requests. This meant customer payments into Monzo would’ve timed out and reversed before we started returning stand-in responses. It was something we couldn’t turn on and off easily without customer impact, so we couldn’t test it as often as we wanted to.
Our new gateway now operates stand-in on the “per-message” level. For every payment message, we start a timer and if our payment processor hasn’t responded by the deadline, we’ll automatically reply with a stand-in response. Our gateway will then continuously check if we can forward the stand-in notification to our payment processor if the connection’s available again. The initial stand-in response is more or less instantaneous, requiring only the timeout to activate. It means there are no rejected payments if we start issuing stand-in responses.
This is a much better experience for our customers, as payments will be automatically credited as soon as we are reconnected to our payment processor, instead of waiting for the sender to send it again. We’ve already seen this happen a few times, as network disconnections happen briefly every now and then. This gives us confidence that the system works without having to perform customer impacting tests.
We took advantage of the work required to build stand-in, and modified it slightly to let us store and forward all messages sent and received at each gateway. This fulfilled a requirement where we needed to store all payment messages sent and received for auditing purposes. And it has a nice side effect of continuously testing our stand-in flow without having to generate real stand-in responses. It uses the same system of writing messages to six disks and forwarding each message whenever a connection’s available.
Hardware Security Modules
A key aspect in connecting to the Faster Payments Scheme is meeting strict security requirements imposed by the scheme to protect all participants. This involves implementing high cryptographic standards to allow secure delivery of messages. This was one of the key areas that we weren’t confident doing ourselves back in 2017, which was part of the reason why we used a third party at the time. With our move in-house, we developed expertise in accredited cryptographic systems, and now securely operate Hardware Security Modules (HSMs) to meet the scheme’s requirements.
HSMs are physical devices that store cryptographic keys and perform cryptographic operations with these keys. These devices help us verify that payment messages we receive are authentic, and they help us in authenticating messages we send to the FPS hub. A motivation behind using HSMs is that it becomes extremely difficult, by design, to extract keys that are stored within HSMs.
HSMs are designed to reduce the attack surface where physical access to a server could result in key extraction from non-volatile or even volatile memory. HSMs are designed to specific standards that disables the device if the hardware is tampered with. They also make it deliberately difficult to handle key material insecurely within application code, providing another layer of protection against programmer errors.
We implemented another highly redundant and replicated strategy for the reliable operation of HSMs. Each data centre has two HSMs, for a total of four HSMs. We can continue operating our gateways with only one HSM.
The move to an in-house gateway has enabled us to increase the level of observability of our FPS system. This has removed a layer of abstraction and opacity from our previous system, and allows us to more clearly monitor the health of our FPS system and react to any issues quickly. All of our FPS components, including the applications running in the data centres, are aggregated via our centralised monitoring system. By feeding both AWS and data centre application metrics into one system, we are able to utilise our existing alerting infrastructure to surface any potential issues.
We read metrics every 10 seconds, allowing a very granular level of observability. If our systems detect abnormalities, our engineers are paged within 60 seconds, alerting them to any issues that may have occurred. We have dedicated engineers that cover a payments on-call rotation, and are trained to respond to incidents at any time of the day or night.
Real-time logging also enables us to perform real-time, ad hoc analysis of unforeseen issues, by letting us quickly query across live payment messages to diagnose and troubleshoot issues. Our payment messages are stripped of sensitive information and stored in Google BigQuery, allowing near-real-time analysis of live and historic data, which can be invaluable during incidents. As a member of the on-call roster before and after the move to our own gateway, I can attest to the improvement in our ability to handle incidents related to Faster Payments!
Our in-house gateway has been live since the end of 2019. We migrated from our third party in the early hours of Saturday 2nd November. We’re glad that everything went to plan — we finished in the first hour of a three hour maintenance window — and things have been ticking along just fine since then 😀
In the future, we’ll look to improve the redundancy of our connections between our data centres and AWS, and thus further improve the reliability of bank transfers for our customers. Our current deployment process for the data centre is also a bit cumbersome and we plan to improve this process by integrating it with our normal deployment process that other engineers at Monzo use.
There’s a lot that I haven’t covered in this blog post, like how we built our own “mock” hub so we can more easily perform functional and non-functional testing, or how we improved the reliability of our data centre to AWS connections. Let us know if you’re interested in finding out more about other parts of this project.
Last but not least, we’re very proud of our team and everyone involved. The FPS gateway project was a major collaboration across Monzo between our Payments, Platform, Customer Operations, and Risk and Compliance teams. Outside of Monzo, there are many thanks due to the great teams at TransferWise, Pay.UK, and Vocalink, who all helped advise us.