At Monzo we process a lot of payments and it’s really important to us and our customers that we process them correctly. That means that we have to take a lot of care not to introduce any bugs when we deploy changes, especially when changing the hot path of our payment systems. However, at the same time we are a high-growth tech company, so shipping frequently is key to our success.
In theory, these two competing priorities should put us in a difficult spot, but in practice we manage to ship many times a day while making sure that payments are processed as expected. This blog post explores how we do this and focuses on two ideas that could be useful outside of the world of payment processing.
A primer on risk
To prevent payments from failing, we first need to know how they could fail. At Monzo we do this by periodically getting together as a team, looking at architecture diagrams and thinking about all the possible ways that a system could fail. We create a list of all of the failure modes of the system and everything we can do to prevent or detect these failure modes. In risk management these are called risks and controls.
A big risk in payment processing, and in tech in general, is introducing a bug. The impact of this risk could be high to Monzo and our customers, because of the large volume of payments that we process. So we put controls in place to make the chance of this risk happening as small as possible. Some of these controls are commonplace in tech. We make extensive use of things like code reviews, automated testing, vulnerability scanning, and feature flagging.
In case we fail to prevent a bug from being introduced, we also have controls in place that can detect them retroactively. Apart from the usual suspects like monitoring and alerting, we have some controls here that are perhaps less common in tech: coherence checking and reconciliation. These controls are crucial to processing payments safely at scale. They could be useful to your team as well, even if you work in a different domain.
Coherence checks are rules that trigger when an object is in an invalid state. It allows you to be notified when a bad outcome has occurred, whether it happened because of an error, a bug in your business logic, or something else. This serves a similar, but subtly different, purpose than error reporting and alerting on metrics. Let’s take a look at an example to illustrate this.
The diagram below shows a simplified version of the state transitions of a payment, which we often refer to as its lifecycle. In this example, we receive a payment on day one, check whether we can match the payment to a customer’s account and if so, pay the customer the next day. If we can’t match the payment to a customer we return the payment.
Based on this payment lifecycle you could define coherence checks to make sure:
if we’ve not returned the payment, we should schedule an action to pay the recipient
if an action has been scheduled, the recipient should be paid the next day
These coherence checks could catch issues such as:
the scheduling service has a subtle bug that means 0.1% of payments are not being scheduled as expected
the scheduler is not scaling well with user growth and it’s not able to get through all scheduled actions within a day
there’s a bug in the logic that pays recipients
If we were just using error reporting and alerting, we could miss some issues that coherence checking catches. Consider the example where the scheduling service has a bug that means 0.1% of payments are not being scheduled as expected. If it’s a bug that doesn’t trigger an error, and we don’t have an alert, or its threshold is set too high, we could miss it.
That’s not to say that coherence checking is a perfect solution. After all, it’s only as good as the checks you’ve defined. We use it in combination with other controls to decrease the risk that we miss any issues with our payments.
In practice, the payment lifecycles are much more complex than in our example, and each payment system has its own set of rules. Often our payment systems will have dozens of coherence checks. We define them by becoming familiar with the payments system, doing risk assessments, and through customer escalations and incidents.
We implement coherence checks at Monzo in two ways. The first way is by running models on the data about the state of each payment in our data warehouse. In essence they’re just fancy SQL statements that filter payments based on our defined rules. The second way in which we implement coherence checks is by implementing them as event-driven backend services that keep track of the state of a payment by listening to events coming from the payment systems. Both approaches work well for us.
The core principle behind coherence checking — surfacing objects that are in an invalid state — can be applied to any system where correctness is important. For example, we also apply coherence checks to make sure all of our card orders are being sent out correctly.
Whereas coherence checks make sure that the state of your system is correct, reconciliation makes sure that the states of two different systems are in agreement.
In the case of payment processing we reconcile our internal ledger, which stores all transactions and is used to calculate account balances, with a settlement account. Settlement is the process that actually moves money between banks, which is often done in bulk. During the reconciliation process we verify that the sum of all movements in our internal ledger is the same as the amount of money that has entered or left our settlement account.
Let’s say we’ve processed two payments of £5 on a particular day. However, the next day £15 has been deposited into our settlement account. That means that we received £5 more than we expected. This is called a reconciliation break or reconciliation difference. It probably means that we failed to process one of the payments that we received.
We’ve implemented reconciliation using models that pull data from two tables. One table contains ledger entries and one contains bank statement lines. We compare them on a daily basis. If there are any reconciliation breaks, an engineer gets notified and fixes the issue, just like we do with coherence checking. We’ve also implemented a system that requires engineers to acknowledge reconciliation breaks and give evidence when they resolve the reconciliation break, so that we have an audit log of how we dealt with each break.
Reconciliation is a great catch-all. If an issue managed to slip past all of your preventative controls, and is not caught by your coherence checks (perhaps you missed an important rule), then reconciliation will catch it. The downside is that it’s not the fastest way to find out about issues, especially when your payment system settles on a daily basis.
There are also applications for reconciliation outside of finance. For example, you could reconcile the databases of two backend services that store different representations of the same underlying data. Or you could reconcile your internal state with the state of an external API that you’re using.
Enforced correctness for peace of mind
Coherence checking and automated reconciliation are used very widely at Monzo. We use these controls for all our payment systems, as well as some other systems where we require correctness. As a result, we’ve got peace of mind that our customers' payments are being processed correctly and on time so that they can be confident in banking with Monzo.
Hopefully this blog post gave you some insight into how you can enforce correctness in your own systems by detecting and fixing issues as soon as they arise.
If you’re interested in building and running critical systems at scale, have a look at our open positions!