At Monzo our goal is to make money work for everyone, and what better way to do that than by giving our customers their money as soon as we can? That’s why we built our Get Paid Early feature, also known as GPE. This lets customers request upcoming payments a day early (technically three days early if the payment is due on a Monday, as you can get it on Friday ready for the weekend!). Unsurprisingly, this is a much loved and heavily used feature, and while we love being able to offer it, it isn’t without its challenges.
In this post I'll explain the problems that occur when hundreds of thousands of customers all request to be paid millions of pounds at almost exactly the same time, and what we do to keep things running smoothly while that happens.
Getting Paid Early
You might be wondering why paying people early poses any challenge at all. After all, paying people is literally what banks do… and you’d be right! However, GPE relates to Bacs payments, and due to Bacs rules it’s possible for us to let people request their payments at 4pm the day before they’re due. This means that if we have 250,000 eligible payments on any one day, we can expect up to 250,000 requests. What we didn’t expect was that a huge wave of these requests would hit us at exactly 4pm, which seems obvious in hindsight because who wouldn’t want to get their money as soon as they could?
These bursts of traffic mean that the workload of our microservices increases by orders of magnitude within the space of about 30 seconds. And not just the microservices directly involved in making Get Paid Early requests, but across our entire platform as customers get their money and then start doing things with it like paying bills, topping up pots, paying shared tabs, or even making ATM withdrawals.
These large, sudden spikes of traffic put a huge amount of strain on our services, and could potentially destabilise our entire platform. This risk is even greater at the end of the month when a lot of salary payments are due or around bank holidays when multiple days of payments are all chunked together.
Why we can’t use auto-scaling
Increased traffic is a problem that's really nice to have as it means that lots of people are using the things you're building, and the tried and tested solution is to scale the services that are doing more work. For example, if your service is doing ten times the work, add ten times the resources. Simple! However, adding resources costs money and uses more energy so if our service does almost no work at 2am, but a lot of work during GPE, we don’t want it to be super scaled up all the time as we’d be burning money and wasting energy.
The typical solution to this is “auto-scaling”, where a service is monitored and automatically scaled up or down in response to how much work it's doing, but unfortunately this won’t work for supporting Get Paid Early traffic as it takes some time to kick in and some more time for extra instances to start up and begin serving traffic. For cases where traffic increases gradually over minutes or more, auto-scaling works perfectly, but when traffic increases by over 500% in a matter of seconds it simply can’t react quickly enough.
So, while we use auto-scaling to handle day-to-day increases in user traffic (for example, as people start waking up and going about their lives buying a coffee in the morning, checking their balance, withdrawing money) if we relied solely on it for supporting GPE events, we would see Bad Things™ happen when the traffic hit us. Services would immediately become overloaded with requests, and although they’d serve as many as they could as quickly as they could, they would almost certainly run out of memory and die. Requests would queue up faster than we could process them and they’d start to time out. As services were brought back online they’d instantly get overloaded and killed again. We’d see a domino effect as errors from one service bubble out to others. The impact would be wide, and our customers would start to see errors thrown from the app, and instead of getting paid early, they’d get annoyed. We really don’t want this to happen.
So what can we do? We know that we need to scale the platform, but we can't do it reactively. Instead we need to scale preemptively and, while we don’t have a crystal ball, thanks to a combination of Bacs payments alerts and observability tools, we can do just that.
Each day we record how many Bacs payments are going to be eligible to be paid early. This tells us the maximum number of GPE requests we could get, but it doesn’t tell us how much impact we should expect across the various services that customers interact with. For this, we rely on our monitoring tool of choice, Prometheus.
Here’s a graph showing the CPU used by our `account` service over a few days. The X axis (left to right ) shows time passing, and the Y axis (up and down) shows CPU utilisation for the service. You can see CPU utilisation grow during the day as our customers start using their app, and shrink at night as everyone goes to bed. There’s always a little bump around lunch time as everyone gets something to eat, which is really cute! Can you see where a particularly large GPE event happened? For bonus points, can you see the batch jobs that we run in the dead of night?
Using Prometheus we can see how much CPU our services used historically, and because we record how many Bacs payments we have each day, we can look up CPU utilisation at the time of a particular sized GPE event. This means we can accurately predict how to scale our platform for a specific volume of Bacs payments.
For example, if 150,000 payments are eligible to be paid early today, and nine days ago we had 160,000 eligible payments, we can look at the maximum CPU utilisation nine days ago and use it to determine how much to scale up by today. The only thing we don’t know is which services to scale…
The final piece of the puzzle
Remember earlier when I said that after people get paid early they immediately start doing things with their money? Well, these things involve lots of microservices working together and as we have over 2,500 microservices, knowing which ones to scale because a customer might interact with them is quite a challenge.
Fortunately, we have one final tool that we can use to help us identify exactly which services we need to take care of. This tool is called Jaeger. We use Jaeger to record “traces” of our requests and these traces show us every single service involved from beginning to end.
Here’s a diagram of what a trace looks like. In this example we can imagine that a request originates at Service A, and passes through B, C, D and E while being processed. Each time it passes through a service a “span” is recorded, and the collection of spans forms the “trace”. The “trace” therefore represents every service involved in satisfying the request.
Between our systems and our customers sits our “edge proxy”, so by analysing traces for requests that originated from our edge proxy we can find all the services that customers can interact with, and by extension the ones that we should consider scaling. For example, when you open your app your feed gets refreshed. This involves a request being sent to our feed service, which then makes a request to a bunch of other services, and some of those will make requests to even more services. This means that many microservices can be involved in responding to a single user request, and it’s these sorts of paths through our system that Jaeger “traces”.
We do this on-demand each time we prepare for a GPE event, which is really neat as it means that as we add new microservices they are automatically picked up and scaled with no additional human effort required!
Putting it all together
Using Jaeger to identify every service we should scale, Bacs records to find a day where we had a similar number of payments, and Prometheus to find the maximum CPU utilisation on that date, we now know which services to scale, and how much to scale them by. This means that we can boost our platform ahead of a GPE event, making sure that we can handle it comfortably, and finally roll everything back afterwards so that we don’t waste money and energy. And that’s exactly what we do.
To make sure that managing all of this scaling doesn’t become a problem that engineers need to spend time on, we use automated jobs to generate a scaling plan, the execution of all the scaling, and then a rollback afterwards. This means that our engineers can focus on what they do best, being creative and solving problems, while platform stability takes care of itself.
Here’s what the scaling looks like for a single service:
On this graph, the X axis (left to right) shows time passing, and the Y axis (up and down) shows CPU utilisation and allocation. The green line shows how much work the service was actually doing, and the yellow dotted line shows its total CPU allocation. We can see auto-scaling happening in the morning as traffic gradually increases, and at ~12pm you can see our pre-emptive scaling for the GPE event. At 3pm UTC (4pm BST) you can see a spike in CPU due to the event itself, followed by a graceful scale down afterwards. Notice how when our GPE event happened we had plenty of headroom? This means that even though we were getting slammed by traffic, the service was in no danger of being overloaded. If you looked at this graph for any of our other services that do more work during GPE, they’d look pretty similar to this.
The great thing about this solution is that it is completely dynamic. It decides how much to scale things by based on the actual traffic we expect, in combination with real-world utilisation data for similar GPE events. This means that whether we expect 10,000 payments, or 100,000 payments, we’ll always scale up by the right amount. Indeed, this system worked flawlessly to scale our platform up for our biggest ever GPE event which happened just before Christmas, and saw over 260,000 customers get paid a total of £210 million ahead of the holidays.
Solving unique problems like how to safely support Get Paid Early is not just hugely enjoyable from a problem solving perspective, but is also incredibly satisfying. It’s a great feeling to relieve engineers of the burden of manually managing stressful things like this, and an even better one to see hundreds of thousands of customers all get paid early without the platform breaking a sweat (not to mention seeing the huge drop in platform costs as a result of our efforts!)
If you'd like to solve problems like this and work on things that make thousands and thousands of people's day better, come join us - we have a few open roles across Monzo including: