We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future.

Read the article

On the 30th of May 2019 between 09:54 and 19:20, around a quarter of bank transfers into Monzo accounts were failing or delayed by several hours. And bank transfers from Monzo accounts were delayed by a few minutes.

During this time, you might’ve had trouble getting payments from other banks, had payments into your account take a while to arrive, or seen bank transfers arrive in your Monzo account then get reversed later.

This was down to a technical problem at the company we use to connect to Faster Payments, the system that powers most bank transfers in the UK.

This is totally unacceptable and we’re really sorry. During and since the issue, we’ve been making sure you haven’t been left out of pocket by what happened, by providing access to emergency money while waiting for a transfer, or covering any fees you’ve been charged because of delayed payments.

To make sure it doesn’t happen again, we’re committing to replacing this third party and bringing everything related to our Faster Payments connection in-house.

We want to explain what happened during the incident, how it impacted you, and how we responded.

Some background

To understand what went wrong, we need to explain the terms we’re going to use throughout this writeup, and go over some technical details about how Faster Payments actually works.

Technical terms

  • Faster Payments (FPS)

    : A UK payment system which connects a number of banks and financial institutions to facilitate real-time bank transfer payments.

  • Hub

    : A computer system that all banks and other financial institutions that use the Faster Payments system connect to. It’s responsible for tracking the state of payments, and making sure each bank understands whether a payment was successful or not. Banks communicate with the Hub, rather than directly with each other.

  • Gateway

    : Software responsible for maintaining secure connections between banks and the Hub, and relaying messages between the Hub and a bank’s internal Faster Payments Processor. Many banks (including Monzo) outsource the operation of a Faster Payments Gateway to a third party company.

  • Processor

    : Software responsible for sending and receiving Faster Payments messages, interpreting them, and moving money to and from customer accounts.

  • Stand-In

    : A common feature of Faster Payments Gateways. This means a Gateway can accept messages on a bank’s behalf if the bank’s Processor is unavailable, and deliver them to the bank’s Processor later.

How Monzo connects to Faster Payments

We launched current accounts in 2017. And, like all bank accounts, you can use Monzo to send and receive bank transfers via Faster Payments.

We built our own Faster Payments Processor, but chose to use a pre-existing Gateway, built and operated by a third party company. This significantly reduced the work we had to do to start sending and receiving bank transfers, as the process of connecting to our Gateway provider was simpler than connecting to the Hub directly.

We still connect to Faster Payments through this third party Gateway today. But we’re committing to bringing everything related to our Faster Payments connection in-house.

How Faster Payments works

When you go to send a friend a bank transfer, you’re asking your bank to send a Faster Payment to your friend’s account.

If you bank with Monzo (the ‘sending bank’), we’ll send a request message to the Hub and wait for a response. The Hub will relay this request message to your friend’s bank (the ‘receiving bank’), which then has 25 seconds to respond to the Hub with a response message.

outage-1

What happens behind the scenes when you send a bank transfer.

The response message indicates whether the bank who’s receiving the payment has accepted it or not. They’ll reply with a response message that uses specific codes to say things like:

  • “this has been accepted and immediately applied to the recipient’s account”

  • “this has been accepted, but not immediately applied to the recipient’s account”

  • “this has not been accepted, the account number isn’t valid”

If the Hub gets a response before the 25 second deadline has passed, it relays the response to the sending bank and records the result.

If the Hub doesn’t get a response before the 25 second deadline has passed, it sends a message to the sending bank with a rejection code to say something like, “the receiving bank did not respond in time.”

The Hub will also then send a “reversal” message to the receiving bank. This lets the receiving bank know there was a request message they didn’t acknowledge, and they shouldn’t put the payment in the recipient’s account as the sending bank has been given a rejection code. This message is repeated until the receiving bank acknowledges it.

outage-2

If the receiving bank doesn't respond to the Hub in 25 seconds, the Hub tells the sending bank that the receiving bank didn't respond in time.

If a sending bank doesn’t get a response from the Hub, they can send a repeat of the same request message. The Hub will see a unique transaction reference number within the message, and respond to it with the same response it sent previously (or relay it on to the receiving bank if it didn’t see it before).

Once these processes are complete, both the sending bank and receiving bank (and the Hub!) have a synchronised understanding of whether or not you’ve made a payment.

What happened on 30th May

All times are in British Summer Time (BST), on the 30th of May 2019.

09:54 – We start receiving invalid payment messages from the third party Gateway we use to connect to the Faster Payments system.

Our third party Faster Payments Gateway begins corrupting approximately 25% of all payment messages sent through it, in both directions. All payment messages include a field that shows the date when the payment was sent. This usually shows the current date in a format like 20190530. The corruption changed this format to something different.

25% of outbound payments are delayed by a minute or two. The Hub ignored any message sent to it containing the invalid date format. So any outbound request message from Monzo would have a 25% chance of not making it to the Hub. If it did, there was a 25% chance the Gateway would introduce this corruption into the response too.

At Monzo, we have a mechanism for automatically resending request messages when we don’t get a response. This means 75% of outbound payments sent instantly as normal. And 25% of outbound payments would go through eventually, but with a delay of a minute or two (while we repeatedly resent the request messages until they went through correctly).

44% of inbound payments are credited, but then reversed. For inbound payments, our Processor was able to work despite the corrupted date field and credit customers correctly. Response messages have the same “date sent” inside them, and we copied the same corrupted date from the request into the response when creating the response message. This guaranteed that any corrupted inbound request message had a corrupted response message, which would be ignored by the Hub.

Our responses to any non-corrupted request messages also had a 25% chance of being corrupted by the Gateway, also causing the Hub to ignore these responses.

When the Hub ignored a corrupted response, it eventually gave up waiting for a correctly formatted response. It returned a rejection code to the sending bank, then sent a reversal message to Monzo telling us that the payment wasn’t successful. Customers impacted by this will have seen some kind of error screen from their sending bank saying that the payment didn’t send successfully. They’ll have seen the money credited to their Monzo account, before being debited again when we got a reversal message.

09:56 – Our monitoring system detects a high error rate processing Faster Payments messages and alerts our on-call engineers.

09:58 – We identify the corruption we’re seeing in some inbound Faster Payments messages. It’s immediately clear that the corrupted date error can’t have been introduced by any of our systems, as the date representation uses a data encoding format that we don’t use.

10:05 – We get confirmation from our third party Gateway provider that they’re seeing the same issues Monzo is reporting, and they confirm that the problem is on their side.

10:12 – We publish an incident on our status page indicating that we’re currently investigating problems with bank transfers.

10:13 – As a precaution, we pause our system that’s responsible for sending outbound repeat messages until we’ve assessed the impact of it. This is one of our standard procedures when we encounter any problems with our Faster Payments connection, as we want to avoid continually sending repeat requests to the Gateway until we’re confident there are no risks.

This means 25% of outbound payments don’t go through at all, and are queued up for later instead. Taking this precaution means 25% of outbound payments messages don’t reach the Hub, and don’t get repeated until we re-enable this system 19 minutes later.

10:19 – We decide that it’s prudent to turn on Gateway Stand-In, so the Gateway can accept messages on our behalf, and deliver them to our Processor later. We do this to avoid continuing to process payment messages which are corrupted. This is because processing corrupted payment messages without a full understanding of what impact the corruption might have is risky.

75% of inbound messages are now queued up for later. The Gateway queues these, but doesn’t start delivering them until 15:24.

outage-3

We turn on Gateway Stand-In, so the Gateway can accept messages on our behalf, and deliver them to our Processor later.

25% of inbound payments continue to be rejected. The Gateway sends a response to the Hub indicating “this has been accepted, but not immediately applied to the recipient’s account”, but is still corrupting 25% of these response messages. The effect is the same as before, as the Hub doesn’t get these response messages.

10:32 – After concluding that there’s no risk to continuing to process outbound payments as normal, we unpause the Monzo system responsible for sending outbound repeat messages.

25% of outbound payments are delayed by a minute or two, and the rest are sent in real-time as normal. We quickly catch up with the backlog that has built up since 10:13.

12:00 (Approx) – Our Gateway provider restarts one of two sets of servers in one of their two datacentres. They later tell us they believed that datacentre was introducing the corruption, but they didn’t restart both sets of servers in that datacentre. The restart doesn’t fix the problem, because the problem was in the set of servers they didn’t restart.

13:08 – We deploy a change to our Processor to remove the corruption from inbound payment messages. The corrupted messages contain enough information to work out what the correct value should be, so we’re able to transform it back into the correct value.

We don’t immediately bring ourselves back out of Stand-In mode and start processing payments again, because we’re still assessing the potential impact of transforming messages before we process them.

13:20 – We identify a case which results in some corrupted reversal messages not being processed correctly. This is already fixed by our change to remove corruption from inbound payment messages, but some customers were left with money in their Monzo account that didn’t actually leave the sending bank.

We take action to reverse the payments that weren’t correctly reversed earlier, for customers who wouldn’t be brought into a negative balance by us doing so. We later get in touch with the customers who would have been brought into a negative balance, to let them know what happened.

13:59 – Having assessed the possible impact of the change we made to work around the corruption and decided there were no risks, we turn off Gateway Stand-In mode.

14:16 – We turn Gateway Stand-In mode back on, because we identified another way that we could fail to reverse payments correctly for payments that went through Stand-In. We re-enable Gateway Stand-In mode to stop ourselves processing more Stand-In messages while we fix the problem.

15:24 – We deploy a change to our Processor which fixed the problem with reversals, and turn off Gateway Stand-In mode. We stay out of this mode for the rest of the incident.

75% of inbound payments succeed in real-time. With Stand-In mode disabled, the Gateway begins sending Monzo payments.

25% of inbound payments are credited and then reversed. As the Gateway is continuing to corrupt our responses to inbound messages, the Hub ignores them and reverses these payments.

We begin processing delayed payments from the Gateway’s Stand-In queue.The Gateway starts sending us the delayed payment messages they’ve been accepting on our behalf.

15:48 – We post an update to our status page saying that we’re working through the backlog of inbound payments that were accepted on our behalf by Gateway Stand-In earlier in the day. We estimate we’ll be able to process all the payments in the backlog by 17:30, and that 25% of inbound payments will continue to fail when we reverse them.

16:58 – We finish processing all the messages from the Gateway’s Stand-In queue. This means the majority of delayed payments are now credited, or reversed.

25% of inbound payments are still being credited and then reversed. The Gateway provider still haven’t fixed the bug that’s corrupting messages, so the problem persists.

We’ve credited the majority of delayed inbound payments. A smaller number of payments are still being queued by the Hub, which we don’t send until the Gateway has stopped corrupting messages.

17:20 – We publish a post on our blog about the incident.

17:21 – We post a status page update to let customers know that we’ve finished processing all the delayed payments that were queued by our Gateway while we were in Stand-In mode, but that 25% of inbound payments will still be reversed.

18:14 – We get an update from the Gateway provider on their progress diagnosing the problem. As before, they say they still see the corruption being introduced in only one of their two datacentres. So they tell us they’re going to start operating from just one of the datacentres, to stop sending payments through the one that might be faulty.

19:14 – We get an update from the Gateway provider, who now believe they’ve isolated the problem. They find the issue is with a system that translates payment messages from one internal format to another (they call this a “transformer”). When processing payments, they run these transformers four times (twice per datacentre). Only one of these transformers is corrupting payment messages, which explains why 25% of payments were corrupted. They let us know they’re going to turn off the faulty transformer, rather than turning off a whole datacentre.

19:20 – The number of reversal messages we’re getting drops to zero. At the same time, we start getting responses to all of our outbound requests. At this point, the real-time impact on payments to and from Monzo accounts is now over - any new payments made after this time send and arrive within seconds.

All payments are now being processed successfully and in real-time. Some inbound payments are still being queued centrally by the Hub, but the majority of customer impact is now over.

19:25 – Our Gateway provider confirms they were able to fix the issue by removing the failed component.

20:03 – We post an update on our status page to let you know that Faster Payments are now operating in real-time again.

20:30 – After the Hub observes that we’ve been stable for some time, it starts sending payments that it queued internally during the incident.

When a bank tries to send a payment request and gets a rejection from the Hub indicating that the receiving bank didn’t respond in time, it has the option to re-send the same payment with a code telling the Hub to queue it until the recipient can receive it. Not all banks do this, some just show some kind of “your payment failed to send” screen instead.

outage-4a

Graph showing the volume of Faster Payment request messages sent to Monzo from different banks.

We can infer from the messages that came in during this burst that the banks that do submit failed payments for the Hub to attempt again later are RBS / NatWest, Santander, HSBC, PayPal, TSB, Starling and Tesco.

20:45 – The Hub finishes sending its queued payments. At this point all delayed payments are now credited.

What our Gateway provider is doing to stop this happening again

Our Gateway provider found a bug that caused one of their systems to get stuck in a state where it corrupts all messages passing through it. They deployed a fix for this on Thursday the 13th of June.

The bug was in a computer program the Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).

But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.

What we’re doing to stop this happening again

Longer term

We’re continuing with a project to bring our Faster Payments Gateway in house, so we don’t have to depend on a third party to handle bank transfers. This is a fairly large project, with long timescales. This kind of project usually takes over six months, from starting to going live. Fortunately, we started this process several months ago, when we experienced a number of smaller-scale incidents. We’re estimating that we’ll switch to our own Faster Payments Gateway in November 2019.

Immediate term

We’re considering changes to make our Faster Payments processor more robust in the face of unexpected data corruption. For example, we could refuse to process a message that has any invalid fields, especially fields that are used in the matching process (like unique IDs, dates etc).

This would have stopped us from ever processing an inbound corrupted message, and therefore eliminated the case where we sometimes didn’t process a reversal when we had previously credited a customer.


We’re really sorry this happened, but we’re committed to fixing it for the future. Let us know if you found this debrief useful on social media or the community forum, and share any other questions or feedback with us too.