On the 29th July from about 13:10 onwards, you might have had some issues with Monzo. As we shared in an update at the time, you might not have been able to:
Log into the app
Send and receive payments, or withdraw money from ATMs
See accurate balances and transactions in your app
Get in touch with us through the in-app chat or by phone
This happened because we were making a change to our database, that didn't go to plan.
We know that when it comes to your money, problems of any kind are totally unacceptable. We're really sorry about this, and we're committed to making sure it doesn't happen again.
We fixed the majority of issues by the end of that day. And since then, we've been investigating exactly what happened and working on plans to make sure it doesn't happen again.
In the spirit of transparency, we'd like to share exactly what went wrong from a technical perspective, and how we're working to avoid it in the future.
We use Cassandra to store our data, and have multiple copies of everything
We use a database called Cassandra to store our data. Cassandra is an open-source, highly-available, distributed data store. We use Cassandra to spread the data across multiple servers, while still serving it as one logical unit to our services.
We run a collection of 21 servers (which we call a cluster). And all the data we store is replicated across three out of the 21 servers. This means if something goes wrong with one server or we need to make a planned change, the data is entirely safe and still available from the other two.
Cassandra uses an element called the partition key to decide which three servers in the cluster of 21 are responsible for a particular piece of data.
Here we have a piece of data represented by the pink square which is set to the value T
When we want to read data, we can go to any of the servers in the cluster and ask for a piece of data for a specific partition key. All reads and writes from our services happen with quorum. This means at least two out of three servers need to acknowledge the value before data is returned from or written to Cassandra.
The entire cluster knows how to translate the partition key to the same three servers that actually hold the data.
We were scaling up Cassandra to keep apps and card payments working smoothly
As more and more people start using Monzo, we have to scale up Cassandra so it can store all the data and serve it quickly and smoothly. We last scaled up Cassandra in October 2018 and projected that our current capacity would tide us over for about a year.
But during this time, lots more people started using Monzo, and we increased the number of microservices we run to support all the new features in the Monzo app.
We run a microservice architecture, and the number of services we run is growing as we offer more features.
As a result, during peak load, our Cassandra cluster was running closer to its limits than we'd like. And even though this wasn't affecting our customers, we knew if we didn't address it soon we'd start seeing an increase in the time it'd take to serve requests.
For some of our services (especially those we use to serve real time payments), taking longer to serve requests would mean we were slowing down the apps and card payments. And nobody wants to wait for a long time when they're at the front of the queue to pay for their shopping!
So we planned to increase the size of the cluster, to add more compute capacity and spread the load across more servers.
Adding more servers means we can spread the data more evenly and support a larger amount of queries in the future.
What happened on 29th July 2019
This is the timeline of events that happened on the day. All times are in British Summer Time (BST), on the 29th of July 2019.
13:10 We start scaling Cassandra by adding six new servers to the cluster. We have a flag set which we believe means the new servers will start up and join the cluster, but otherwise remain inactive until we stream data to them.
We confirm there's been no impact using the metrics from both the database and the services that depend on it (i.e. server and client side metrics). To do this, we look for any increase in errors, changes in latency, or any deviations in read and write rates. All of these metrics appear normal and unaffected by the operation.
13:14 Our automated alerts detect an issue with Mastercard, indicating a small percentage of card transactions are failing. We tell our payments team about the issue.
13:14 We receive reports from Customer Operations (COps) that the tool they use to communicate with our customers isn't working as expected. This meant we weren't able to help customers through in app chat, leaving customers who'd got in touch with us waiting.
13:15 We declare an incident, and our daytime on-call engineer assembles a small group of engineers to investigate.
13:24 The engineer who initiated the change to the database notices the incident and joins the investigation. We discuss whether the scale-up activity could have caused the issue, but discount the possibility as everything appears healthy. There's no increase in errors, and the read and write rates look like they haven't changed.
13:24 Our payments team identify a small code error in one of our Mastercard services, where a specific execution path wasn't gracefully handling an error case. We believe this is the cause of the Mastercard issue, so get to work on a fix.
13:29 We notice that our internal edge is returning HTTP 404 responses.
Our internal edge is a service we've written and use to access internal services (like our customer operations tooling and our deployment pipeline). It does checks to make sure that we only give access to Monzo employees, and forwards requests to the relevant place.
This means our internal edge couldn't find the destination it was looking for.
13:32 We receive reports from COps that some customers are being logged out of both the Android and iOS apps.
13:33 We update our public status page to let our customers know about the issue. This also shows up as a banner in our apps.
"We're working to fix a problem that is causing some card transactions to fail, or the Monzo app to display incomplete or old information. If affected, you may also be unable to login or receive bank transfers."
13:33 The Mastercard fix is ready to deploy, but we notice our deployment tooling is also failing.
13:39 We deploy the Mastercard fix using a planned fallback mechanism. And we see an immediate improvement in card transactions succeeding.
13:46 We identify our internal edge isn't correctly routing internal traffic. Beyond authentication and authorisation, it validates that it's 'internal' by inspecting the request and using our configuration service to match against our private network range.
A configuration service is a Monzo microservice that provides a simple request-response (RPC) interface for other services to store and retrieve key-value pairs representing configuration. Internally, it uses Cassandra to store its data.
We conclude that either the configuration service is down, or our network range has changed. We quickly rule out the latter and focus on the service.
13:48 We try to get data from our configuration service directly and realise it's returning a 404 (not found) response for the key we attempted to retrieve. We're confused, as we believe that if the configuration service wasn't working, it'd have a much wider impact than we were seeing.
13:53 From the metrics, we see successful read and writes to the configuration service, which is surprising given we've just seen it fail to retrieve data. It feels like we're seeing conflicting evidence.
13:57 We search for some other keys in the configuration service and realise they're still there.
14:00 We bypass the configuration service interface and query Cassandra directly. We confirm that the key used by the internal edge is in fact missing from Cassandra.
14:02 At this point we believe we're facing an incident where some of our data isn't available, so we turn our focus to Cassandra.
14:04 Despite our earlier fix for Mastercard, we can see it's still not fully healthy and the payments team keeps working on it. This means that a small number of card payments are still failing.
14:08 We query whether the new Cassandra servers have taken ownership of some parts of the data. We don't think this is possible given our understanding of what's happened so far, but we keep investigating.
14:13 We issue a query for some data in Cassandra and confirm the response is coming from one of the new servers. At this point, we've confirmed that something we thought was impossible, had in fact happened.
The new servers had joined the cluster, assumed responsibility for some parts of the data (certain partition keys to balance the load), but hadn't yet streamed it over. This explains why some data appeared to be missing.
14:18 We begin decommissioning the new servers one-by-one, to allow us to return data ownership safely into the original cluster servers. Each node takes approximately 8-10 minutes to remove safely.
14:28 We remove the first node fully, and we notice an immediate reduction in the number of 404s being raised.
Our internal customer support tooling starts working again, so we can respond to customers using in-app chat.
15:08 We removed the final Cassandra node, and the immediate impact is over. For the majority of customers, Monzo starts working again as normal.
15:08 → 23:00 We keep working through all the data that had been written while the six new servers were actively serving reads and writes. We do this through a combination of replaying events which we store externally, and running internal reconciliation processes which check for data consistency.
23:00 We confirm that all customers are now able to access their money, use Monzo as normal, and contact customer support if they need to.
We misunderstood the behaviour of a setting
The issue happened because we expected the new servers to join the cluster, and stay inactive until we carried out a further action. But in fact, when we added new ones they immediately took part in the reading and writing of data, despite not actually holding any data. The source of the misunderstanding was a single setting (or 'flag') that controls how a new server behaves.
In Cassandra, there's a flag called
auto_bootstrap which configures how servers start up and join an existing cluster. The flag controls whether data is automatically streamed from the existing Cassandra servers onto a new server which has joined the cluster. Crucially, it also controls the querying pattern to continue serving read requests to the older servers until the newer servers have streamed all the data.
What we expected: new nodes join the cluster in an inactive mode, and get assigned a portion of the data. They remain in this state until we actively stream data into them, one-by-one.
In the majority of cases, it's recommended to leave the default of
true. With the flag in this state, servers join the cluster in an 'inactive' state, have a portion of the data assigned to them, and remain inactive in data reading until the data streaming process finishes.
But when our last scale up in October 2018 was complete, we'd set the
auto_bootstrap flag to
false for our production environment. We did this so that if we lost a server in our cluster (for example, due to hardware failure) and had to replace it, we'd restore the data from backups (which would be significantly faster and put less pressure on the rest of the cluster) rather than rebuild it from scratch using the other servers which had the data.
During the scale-up activity, we had no intention to stream the data from backups. With
auto_bootstrap set to false, we expected the six new servers would be added to cluster, agree on the partitions of data they were responsible for, and remain inactive until we initiated the rebuild/streaming process on each server, one-by-one.
But this wasn't the case. It turns out that in addition to the data streaming behaviour, the flag also controls whether new servers join in an active or inactive state. Once the new servers had agreed on the partitions of data they were responsible for, they assumed full responsibility without having any of the underlying data, and began serving queries.
Because some of the data was being served from the new servers which didn't have any data yet, that data appeared to be missing.
So when some customers opened the app, for example, transactions that should have existed couldn't be found, which caused their account balances to appear incorrectly.
Once the issue had been resolved, we were able to fully recover the data and correct any issues.
To stop this happening again, we're making some changes
There are a few things we can learn from this issue, and fix to makes sure it doesn't happen again.
We've identified gaps in our knowledge of operating Cassandra
While we routinely perform operations like rolling restarts and server upgrades on Cassandra, some actions (like adding more servers) are less common.
We'd tested the scale-up on our test environment, but not to the same extent as production.
Instead of adding six servers, we tested with one.
To gain confidence in the production rollout, we brought a new server online and left it in the initial 'no data' state for several hours. We did this across two clusters.
We were able to confirm that there was no impact on the rest of the cluster or any of the users of the environment. And at this point we were happy that the initial joining behaviour of
auto_bootstrap was benign as we expected. So we continued to stream the data to the new server, monitored throughout, and confirmed there were no issues with data consistency or availability.
What we failed to account for was quorum (the three servers agreeing on a value). With only one new server, it wouldn't matter if it joined the cluster fully and didn't hold any data. In this case, we'd have agreement from the other two servers in the cluster.
But when we added six servers to production, the data ownership had two or three members reallocated to the new nodes, meaning we didn't have the same guarantee because the underlying data didn't reallocate.
We've already fixed the incorrect setting
We've already fixed our use of
auto_bootstrap. And we've also reviewed and documented all our decisions around the other Cassandra settings. This means we have more extensive runbooks – the operational guides we use to provide step-by-step plans on performing operations such as a scale up or a restart of Cassandra. This'll help fill the gaps in our knowledge, and make sure that knowledge is spread to all our engineers.
Another key issue that delayed our actions was the lack of metrics showing Cassandra as a primary cause of the issue. So we're also looking at exposing more metrics and adding potential alerting for strong shifts in metrics like 'row not found'.
We'll split our single Cassandra cluster into smaller ones to reduce the impact one change can have
For a long time, we've run a single Cassandra cluster for all our services. Each service has its own dedicated area (known as a keyspace) within the cluster. But the data itself is spread across a shared set of underlying servers.
The single cluster approach has been advantageous for engineers building services – they don't have to worry about which cluster to put a service on, and we only have to operationally manage and monitor one thing. But a downside of this design is that a single change can have a far-reaching impact, like we saw with this issue. We’d like to reduce the likelihood that any single activity can impact more than one area of Monzo.
With one cluster, it's also much harder to pinpoint the source of failure. In this instance we took almost an hour from the first alert to the point where we pulled the information into a single coherent picture that highlighted Cassandra at fault. With smaller and more constrained system configurations, we believe this would have been a much less complex issue to deal with.
In the long term, we plan to split up our single large cluster into multiple smaller ones. This will drastically reduce the likelihood and impact of repeat issues like this one, and make it safer for us to operate at scale. We want to make sure we get it right though; doing it in such a way that doesn't introduce too much operational complexity, or slow our engineers down at releasing new features to customers.
We’re really sorry this happened, and we’re committed to fixing it for the future. Let us know if you found this debrief useful, and share any other questions or feedback with us too.