As is the case in almost any industry, we have incidents at Monzo. While this sounds a little scary, the term "incident" just means something going wrong or not working as expected.
Incidents can happen anywhere, and cover everything from office building issues, to technology outages that impact our customers. We can’t stop incidents from happening, but we can make sure we’re ready to deal with them. And we can use them as a way to learn more about how things really work.
In previous posts we’ve shared how we monitor Monzo so we know what’s going on with our systems, and how we structure our on-call teams so we’re ready to fix issues day or night. The final piece of the puzzle is how we respond to incidents. A good response process can mean the difference between something being resolved in minutes and an issue developing into something much worse.
We’ve spent a lot of time refining our processes to make sure we deal with incidents swiftly, and with as little customer impact as possible. So we'd like to share what we've learned.
What is incident response?
Incident response is a broad term which describes the processes we follow when something unexpected happens. In addition to dealing with and fixing the actual issue, there are a few other things we have to think about, including:
If there’s an issue that might affect our customers, it’s important that our Customer Operations team are aware so they can be ready to support our customers.
Communicating with our customers
If there’s an issue which might affect our customers, we use our Status Page to let them know what’s happening. We use the status page liberally, and often when an incident only affects a minority of our customers. We know this risks us being perceived as being less reliable than other banks, but we default to transparency (and it's the right thing to do!)
Clearly defining roles
Every incident needs to have a lead who's responsible for coordinating the efforts of others and making sure we do everything that’s expected. Interestingly, this role has parallels in other industries like firefighting, where an incident commander is responsible for organising everyone on the scene of a fire.
Getting hold of the right people
We have two people on-call at any time, and for the majority of incidents they’re able to fix the issue without escalating to anyone else. Sometimes though, they’ll need to pull in other engineers. We have a process of escalating to other on-call engineers across all areas on Monzo.
Keeping track of what’s going on
When things go wrong, it’s important that we use our experiences to learn. We report all incidents and keep thorough records of what happened.
Incidents are often high pressure situations, so we want to make it as easy as possible for our engineers to do the right things whilst they also fix the problem. Rather than defining processes and writing procedures (these are rarely followed in the heat of the moment), we decided to build a tool to help out. It’s called Response ⚡and we’ve open sourced it so you can use it too!
Using Slack for more than just messaging
We use Slack to communicate across all areas of Monzo, and incidents are no exception. Slack is more than just a messaging platform; it also allows developers to create applications built around conversation - a paradigm known as ChatOps. Our incident tool works in this way, and it allows us to manage and coordinate incidents without leaving the conversation. By remaining in one place, it becomes far easier to keep context of what's going on and reduces the effort required to get things back in good shape.
Declaring an incident
When an incident is detected, we can report it with a simple
When the incident is declared, a post is added to our #incidents channel which shows what was reported, the person who reported it, and a link to a live incident document. The live doc is a web interface that shows much of the same information at this stage, but becomes a more comprehensive as an incident unfolds.
We can also page our first line on-call engineers from here, for example when our COps need to alert them of an issue being reported by our customers. Clicking the button triggers PagerDuty (our on-call scheduling and alerting tool) to call and message them at any time of the day.
Creating a dedicated place to communicate
Another important feature of the post in #incidents is the button allowing us to create a dedicated channel for coordination. Pressing this creates a new channel for the incident, and provides a link on the main post.
When the channel is created, our incident bot is automatically added, making it available to help out with common incident tasks. We've defined a number of helpful commands to streamline these or automate them away altogether.
Setting the Incident lead
Every incident needs a lead, and it’s useful for everyone to know who it is. We can assign the incident lead with a simple command.
@incident lead @chris
This will update the main post, the live document, and log the time that the lead changed in the document timeline - lots of information for a simple action!
We can also use commands like
@incident summary and
@incident impact to keep the summary and impact up to date with the latest information. All the information entered through these commands is automatically updated in the live incident doc.
Updating our Status Page
We have a status page which we use to communicate any customer impacting incidents. When the status page is updated, it triggers a post to our @MonzoStatus account on Twitter and, depending on severity, may show a banner or popup within the Monzo mobile apps. We can update the status page from within the comms channel using
@incident statuspage , and doing so brings up a simple dialog within Slack.
Escalating to Specialists
For more complex incidents, we might need to bring other on-call engineers on-board to help out. When we trigger the
@incident escalate command, a list of the teams available as an escalation points is posted. When the button is clicked, a prompt is shown where a custom message can be entered.
Once submitted, our paging tool will call them and a friendly(!) robotic voice will read the message. With this, we can escalate to several teams in a matter of seconds, and shave valuable time off the time it takes to resolve the issue.
We often find ourselves off the beaten track and our ability to make decisions and changes on-the-fly is something we value greatly, but it's important we leave a breadcrumb trail to trace our steps afterwards. To allow this we've implemented a command to log an action for follow up at a later time, and use it for everything from "remember to speak to person X about why this is the way it is" through to "implement a permanent fix for Y". Making it seamless and consistent to log these points allows us to capture actions in realtime and be thorough when we follow up later.
Whilst these actions might seem like trivial improvements in isolation, the accumulation of them adds up to less context switching, more time to focus on the issue, a better experience for our engineers, and an overall swifter resolution for our customers.
Making it easy to do the right thing
Incidents can be stressful, and despite our best efforts and intentions, the processes and procedures we think up in the calm of the working day rarely get followed exactly when it's 2am and something is broken. Our procedures aren't overly bureaucratic, and our on-callers aren't deliberately flouting the rules, so why does this happen? The problem is cognitive load, or more simply the amount of tasks a person has to think about and deal with at any given time. To help make things better, we use a combination of automation, nudge theory, and contextually relevant prompts to make it easy to do the right thing. Broadly speaking, we want to make it so easy to do the right thing, that people would have no reason to do anything else.
Setting the severity correctly
We try to classify all incidents with a severity rating; one of critical, major, minor, or trivial. Whilst defining the severity is far from an exact science (incidents and their impacts can be quite nuanced), we find it valuable to make a judgement and set one.
We often don't know the extent of the problem at the start of the incident so the guidance we provide is to set the severity once known and ideally within the first 15 minutes. In the heat of the moment this is easy to forget so our bot posts a prompt every 15 minutes until it’s set.
This helps make sure a severity is set, but it doesn’t help to make sure it’s the right one. We’ve spent some time defining guideline criteria for each severity, but nobody has time to consult a document of this in the heat of an incident. Instead we provide the corresponding criteria whenever it is updated, and provide a prompt for engineers check it looks right, or adjust if necessary.
Automatically notify the right people
There are occasions when we would like to share updates with our regulators. We have an internal team who handle regulator communication, and they provide detailed updates whenever there is a sufficiently serious issue affecting our customers. We used to rely on our on-callers to escalate to the regulator comms team when they felt it relevant, but this was highly subjective and prone to error.
Since we always put the status page up for such issues, we’ve hooked into this same process to automatically notify our regulator comms team, pulling them in as needed. To make this even more responsive, we also notify them when the status page is mentioned, which is almost always the case when the on-call engineers are deciding how best to communicate the issue clearly.
Helping to complete reports
After the incident, we want to make sure the report is completed. This typically happens in the days following the incident, and is the responsibility of the incident lead. To help make sure this happens, our bot messages the lead, asking them to complete the report and providing direct links to the form where they can do it. This has drastically improved our reporting, and therefore reduced the need for us to chase people manually.
So that's how we deal with incidents! In a future post we'll share what happens in the days and weeks following to make sure we're using them to maximise learning across the organisation 👩🎓