At Monzo, customer service often involves helping people with banking queries. But sometimes we have customers going through very challenging moments in their lives that put them at risk of immediate harm. We believe it's essential to support these customers appropriately, especially those at the greatest risk of harm. As a regulated bank, Monzo is also required to take additional care to ensure we meet the needs of consumers at the greatest risk of harm. We do this by designing products and services to recognise and respond to the needs of vulnerable customers.

We have specialists trained to safeguard these customers, but identifying them is difficult; people express vulnerability in a wide variety of ways, and our support teams handle tens of thousands of chats a day. We built an AI system to automatically identify these customers and get them dedicated safeguarding support more quickly.

Early results have been promising - we’re now escalating our most vulnerable customers in a matter of seconds. This gives us greater confidence that our support reaches the people who need it most, and that our specialists spend their time where they can make the biggest difference.

In this post, we share how we used large language models (LLMs) to tackle this problem, and how we rolled this out safely; redesigning our backend events to capture richer signals, running the model in shadow mode against live traffic before we actioned any predictions, and rolling out gradually.

Getting the right data

Machine learning models are only as good as the data that they are trained on, and this also applies to approaches using LLMs. We had to address numerous challenges in our data, the first of which was a lack of sufficient, high quality examples of cases where customers were facing risks of immediate harm. We also didn’t have a reliable way of gathering ground truth, which would tell us how many manually escalated cases were genuine, and how many customers were not getting the support they needed.

Furthermore, cases where our customers are facing immediate risks of harm are very rare, representing <0.1% of all chat messages our customers send. This can pose challenges for machine learning models, as there may not be enough examples for the models to learn patterns from.

To address these issues, we worked closely with our specialists to iteratively curate a few labelled datasets for us to train and validate our model:

The "Golden Dataset": We worked with our specialists to manually review a dataset containing a higher concentration of real-life examples of risk. Although this concentration was not representative of reality, we needed to oversample such cases to help the model learn the varied ways in which customers express risks of immediate harm.
A "Real World" test: We took a typical day of chats to see how the model would perform in a more realistic setting. This was crucial because the golden dataset didn't reflect the 'noise' of a standard day, so we needed to ensure that the model would not flag too many false positives in a real world setting.
A "Forecast Set": We looked at several weeks of recent data to understand how many cases the model would flag to our specialist teams. This helped us make sure that we would have enough specialists on hand when we implemented the model.

We also established an ongoing process to label a daily sample of data that serves as our reliable 'ground truth' for continuously monitoring the performance of our model. This ensures that we can quickly identify if the model is not performing as expected, and improve the model using data informed by our specialists’ expertise.

Training the best model for the job

We experimented with several approaches, including traditional methods like keyword detection, statistical models like TF-IDF, and text embeddings. However, we found that LLMs were the clear winner. Traditional methods didn’t perform as well because they lacked the generalisation and contextual understanding necessary to identify the subtle ways distress is expressed.

For example, a keyword-based system might catch specific phrases but lacks the ability to understand context. Text embeddings capture context, but may miss a customer describing their situation indirectly or using euphemisms. Because LLMs are pre-trained on vast amounts of human language, they understand subtext and nuance. This makes them more effective at identifying the wide variety of ways in which people express vulnerability.

However, LLMs can still misinterpret what people say, especially because risks of harm can be expressed in unique and complex ways. To mitigate this, we continuously worked with our specialists to verify ambiguous cases that we found while testing the model, and to improve the range of examples provided to the model. Throughout testing and validation, we prioritised minimising false positives to make sure the system only flags genuine emergencies. This means our specialists’ expertise is saved for the customers who need it most.

Like any model, LLMs also reflect biases in their training data, which means they may miss vulnerabilities that are less common. To ensure we identify and support as many customers as we can, we continue to allow our customer service teams to manually escalate any cases that slip past the model.

Getting the model into production

With the model in place, we had to build the software to automatically detect and escalate our most vulnerable customers to specialist safeguarding teams. The system would analyse our incoming chats, run them through the model, and ensure these vulnerable customers are escalated to the correct team immediately.

The rollout also needed to be safe: we didn’t want to end up in a situation where customers at risk of harm weren’t able to get the support they needed because the model had overwhelmed our specialist teams with too many false positives. We also needed to collect data on automatic and manual escalations to monitor whether the model was escalating the right customers. Here’s how we did that.

Shadow mode: Testing the model without affecting customers

One of the first things we introduced once the model was ready was “shadow mode”, which is a way to run the model on real conversations without it actually triggering any escalations. Doing this enabled us to:

Forecast the volume of expected escalations against live traffic, to understand the impact on operational capacity
Resolve any issues related to invoking the model, such as latency and resilience
Improve the performance of the model in the real world by working with our specialists to understand where it was going wrong

All this happened before we launched the automatic escalation capability, hugely de-risking the launch and avoiding a big bang release. This required collaboration across different disciplines, working as a team to achieve our goal, embodying Monzo’s value of 'Help everyone belong'.

Unblocking analytics before launch

We simplified and expanded how we were collecting data on these escalations so that we could see how the system was performing, covering cases where we already manually escalate customers and cases where the model would have automatically escalated them. This enabled our analytics engineers to build the necessary pipelines and dashboards even before automatic escalation was launched. It also simplified our data pipelines by giving them a cleaner, more consistent event structure to build on.

As a result, before we'd even launched automatic escalations, we could see the expected volume of escalations, compare the model’s predictions with manual escalations, and track its performance over time. This gave us and operational teams the confidence to proceed. By identifying owners, measuring, monitoring, and continuously reviewing the performance, we demonstrated our commitment to one of Monzo’s values - 'Think big, start small, own it'.

Making automatic escalations clear and safe

Finally, we introduced the actual automatic escalations. We didn’t want to negatively impact the experience for our customer support team, so we needed to make it clear exactly why customers were being escalated (the reasoning behind the model’s decision) and what triggered the escalations. We introduced changes to our existing internal tools to achieve this.

We also used a feature flag to give us fine-grained control over the rollout of automatic escalations. This meant we could gradually roll it out and closely monitor for operational impact. If we spotted any issues with the model or software bugs, we had a simple kill switch to return to how things worked before.

Throughout the development, we worked iteratively with Product, Design and Operations teams to reduce complexity for our customer support team, and integrate the model into our existing tools so that our Monzonauts knew exactly what to expect.

We rolled out gradually over a few weeks - shadow mode and early analytics meant we'd already ironed out model issues, built the dashboards, and given operational teams the confidence to proceed, proving that the careful, iterative approach was the right one. This means customers in crisis are now getting to safeguarding specialists more quickly, upholding Monzo’s value to 'Make a difference'.

What we learned along the way

Get alignment early: Getting everyone on the same page early is crucial - from engineers to operations, risk, and compliance. Proactively managing the risks and challenges involved in developing such a sensitive feature ensured that we were able to ship a reliable and impactful change for our most vulnerable customers.
Adapting to new ML tools: Using LLMs brought new challenges because they’re less deterministic than traditional machine learning models. Evaluation is also more subjective, as we had to rely on the model’s reasoning to understand why it produced certain results. We adapted our approaches to monitoring, testing and evaluating the model to handle these new challenges, ensuring the same level of reliability we’d expect from any ML system.
Managing risks through iterative rollout: Safety was a priority for rolling out a sensitive feature like this. Shadow mode, early analytics and feature flags helped us manage the risks and gave us the confidence we needed to ship incrementally.

As a result of this change, customers in crisis are referred to the specialist safeguarding team within seconds. This is more than 10 times faster than what was happening before, ensuring that our most vulnerable customers are safeguarded appropriately. It’s also just one part of Monzo’s broader efforts to enhance how we support vulnerable customers.

Interested in a career at Monzo?

If what you’ve read here resonates with you, we’re hiring for Machine Learning Scientists, Data Scientists, Software Engineers, and many more across Monzo! Take a look at our careers page to see if we have the right role for you.

Accelerating Safeguarding Support for our Most Vulnerable Customers with Machine Learning