Scaling our on-call process

Read the article

On-call in engineering organisations is a hot topic. Done well, it ensures our systems are well supported and our on-callers maintain a healthy work-life balance. Done badly, it can lead to poor outcomes for our customers, and be the difference between engineers being happy at work or deciding to work somewhere else. 

As we’ve grown at Monzo, we’ve had to keep iterating our on-call processes to keep all of our systems running smoothly, and to live up to our commitment of ensuring being on-call doesn’t come at the expense of our engineers' wellbeing. The original process for our team of 50 doesn't scale when we're now a team of 150+ engineers so iterating on the process is critical.

We first wrote about our on-call process in 2018, and followed this up with another post in 2020. Throw our growing engineering teams, a worldwide pandemic, and our rapidly increasing customer base all into the mix, and we decided in early 2021 that it was time to make a few changes to how we’re running on-call. So here we are again and back with a bang in 2022 to share some of the changes we’ve made to on-call over the last year to keep our systems running exactly as our customers expect them.

How we used to manage on-call

Our previous model saw us opt for a setup where we had primary and shadow on-callers, acting as a first-line group of engineers. This group was the first point of contact whenever there was a problem, and it was their job to either fix the issue, or find the team who could. 

They were supported by a number of specialist groups of engineers who were there as escalation points. For example, we might need a team with deep specialist knowledge of a specific payments system to help troubleshoot an issue.

Key problems

Our previous on-call model worked very well for us when we were a smaller engineering organisation of around 50 engineers, but we started to see signs that the process was beginning to creak a little as we were scaling to over 100 engineers. 

The primary/shadow model meant that we always had to have a primary team on-call. This caused two problems:

  • The primary rota was often a point of friction between an alert firing and it being addressed by the team who had the right context and knowledge to quickly fix it

  • Our Platform team acted as the primary on-caller during office hours, which put undue stress on them as the Monzo product and team grew

This centralised way of managing on-call meant that we were quickly heading towards a position where we had two or three individuals in the organisation who were critical to on-call functioning properly. As well as increasing the circus factor, this was time consuming and on-call was increasingly being seen as something centrally-managed, rather than owned by the teams who were responsible for their own systems.

As we were scaling, the bar for the engineers on the primary rotas was being raised almost weekly. For every new critical service we introduced into production, we needed to train those engineers on how to handle any alerts, and who to escalate to if things weren’t going well. This in turn meant that the barrier to entry for new engineers joining the primary rota was so high that it started to put people off from joining, and became expensive and time consuming to keep the rota fully staffed.

Enter Monzo On-call 3.0

We launched some new changes to our on-call processes in early 2021, and we’ve been keeping a close eye on how things have scaled over the last year. 

Our key focus with the changes was to decentralise on-call and give every team the autonomy to manage being on-call for their systems in a way that works best for them and our customers.

Our Platform teams provide a framework to make it easy for all teams to be on-call for their services. We provide the monitoring, alerting and routing tools required to detect problems and deliver any alerts to the teams that own them. As a result, the previous primary/specialist setup has been removed, in favour of alerts being directly routed to owning teams who are each on-call for the systems they own and run. 

At a very high level, here’s how things look today:

A simple flow diagram showing Prometheus firing a critical alert being routed to a specific team

Benefits

Team accountability

All of our teams are now on-call for the things they run. This gives each team the direct feedback they need on the reliability and supportability of their systems, and means that the teams with all of the right context and knowledge are the first responders to an alert relating to that service. 

This also promotes good ownership of problems and any incidents in that teams’ own domain. As a result, they’re naturally motivated to fix things so their colleagues aren’t paged for a problem overnight.

Human centred on-call

We wrote about the importance of putting humans at the centre of the on-call process in our previous blog post, and it’s something we’ve made sure is front-of-mind during these improvements. Our engineering managers are now responsible for ensuring that the on-call schedules for their teams are well-staffed, and that engineers feel well equipped and supported to deal with any pages. 

They can also tailor the schedules so that they flex to changing demands in their teams. For example, a team might want to tighten their alerting thresholds during a key product launch. Engineering managers also act as a point of escalation for any problems that directly affect the teams they’re responsible for.

By nature of the process being closer to teams, managers are in a better position to support on things like:

  • when an engineer has been paged overnight, they’ll remind them that they should take time off in lieu to rest up

  • tailoring an engineer’s workload in-hours for the week they’re on-call, depending on what’s happening

  • making the general feedback loop much tighter from engineer to their manager who owns the schedule - for example, if they need help or someone to cover their shift for an evening

A screenshot showing a (redacted) Slack message from a manager to an engineer encouraging them to take time off after an incident

Technical Incident Managers

We took the opportunity during this process to introduce the role of a technical incident manager to support engineers during incidents. They’re a group of engineering leaders from across Monzo who are trained on how to deal with major incidents, are familiar with our incident management process, and have a good network across the organisation to quickly facilitate the flow of information. 


If an engineer is on-call and needs some help, guidance, or needs to escalate a problem, our technical incident managers are on-hand to jump in and support in whatever way they’re needed. To help with this we introduced some prompts to our incident tooling, and an easy escalation process to get hold of an incident manager whenever they’re needed, 24 hours a day.

A screenshot of the author of this blog post escalating to an Incident Manager using our incident tooling in SlackA screenshot of our incident tool UI showing how a user can select to escalate to Incident Managers from a drop-down menu

On-call is a living process

We’re constantly on the lookout for how we can make on-call better; whether that’s how we respond to problems impacting our customers even more quickly, or how we improve the experience of engineers joining on-call for the first time. We regularly run retros across Monzo to ensure the process is running effectively, and we’re constantly learning, tweaking, and improving our processes to make sure they’re working for our customers, and our engineers.


Let us know what you think of our approach to on-call, and what's working well for you in your own organisations. If you're interested in joining our team of engineers and managers to tackle problems like this, we're hiring!