Engineering the Future of Customer Operations: The Monzo Ops Agent

As we wrote in a recent blog post, machine learning is a core component of Monzo's product. It drives critical decision-making across fraud prevention, credit, and personalisation, while maintaining high validation standards and achieving positive customer outcomes expected of us.

Building on these foundations, we have recently begun exploring how Generative AI can elevate the customer experience, including by helping our teams work more efficiently behind the scenes. While this technology is being applied across several areas of the bank, in this post, we will focus specifically on its impact in customer operations, a domain spanning customer support and workflows for reporting and investigating fraud and financial crime.

Effectively addressing customer queries in the moment means pulling together several things at once: what we know about the customer, our product knowledge, real-time context, and sound judgment. Anything customer-facing in a heavily regulated business means reliability and measurability aren't optional, they must be built-in from the start.

This post explains how we engineered our LLM-based agent, built rigorous evaluation systems for safe improvement, and are shifting from simply answering questions to automating complex, end-to-end operational workflows—demonstrating how each step builds on the last. With this background, let’s begin by exploring the motivation for automating customer operations.

Why automate customer operations

Customer operations involve diverse workflows, from routine requests like managing a Pot to high-urgency events like a stolen wallet, to sensitive cases involving debt repayment. Customers reach out to us for more than 150 different problems, which we internally refer to as (customer) intents. The complexity varies significantly: some intents can be resolved in minutes, while others require investigations spanning several weeks.

Customer expectations for support from Monzo have continued to evolve. They expect service that is fast, frictionless, and allows them to resolve simple issues independently. Customers increasingly prefer chat-based interactions for quick resolution of day-to-day support needs, appreciating the asynchronous format that allows problems to be solved in the background without demanding their undivided attention. However, they still expect access to human support when situations become more complex or sensitive.

This creates a clear opportunity for LLM-based systems to improve both customer experience and operational efficiency. Automation can help simple workflows run more quickly, while giving customer operations teams more time for higher-complexity cases where human judgment, context, and empathy matter most.

Our approach does not aim to replace humans outright. Instead, we have engineered an agent that uniquely blends automation with human oversight, enabling faster responses while upholding the quality and reliability our customers expect, and providing human judgment when it’s essential.

Operations Agent V1: The Foundation

Our first Ops Agent was built through an iterative, evaluation-backed development cycle. From the beginning, the architecture was designed to keep the system grounded in knowledge that is verified by subject matter experts and to escalate when confidence was not high enough:

  • Input and Output Guardrails: Guardrails detect hallucinations or non-compliant input and output. If any guardrail is tripped, the system hands the conversation to a human specialist.

  • Triage and Action Logic: A specific prompt determines the next logical step — responding, handing off, or closing the case.

  • Answer generation: A generation prompt drafts answers based on a knowledge base of documents vetted by subject matter experts. The answers are also written in the tone of voice we expect to maintain when responding to our customers

Evaluating conversational agents is difficult, especially in a domain where the right answer depends on customer context, policy, and the state of the conversation. We prioritised validation by establishing a ‘golden set’ of 100 conversations that had been approved by subject matter experts as having met the high standards we have for good resolutions. A golden set of 100 conversations is too small to be representative for all conversations that the agent could encounter, but even with such a relatively small data set we were able to get useful feedback to iterate on the agent. By replaying these with the agent message-by-message, we could measure its performance against a human benchmark in a repeatable way. After this phase we expanded the offline evaluation to replay many daily conversations We also had subject matter experts run through scenarios to evaluate the agent on novel queries. 

Through this phase of offline evaluation, we made several data-driven optimisations:

  • Model Selection: We found that non-thinking models were as effective for triage as more expensive, slower alternatives, allowing for higher efficiency without sacrificing accuracy.

  • Contextual Enrichment: We moved beyond static articles, extracting facts from historical conversations to give the agent a more nuanced understanding of Monzo’s specific processes.

  • Self-Correcting Guardrails: We implemented a feedback loop that instructs the agent to regenerate an answer if a hallucination is detected, reducing unnecessary handovers while maintaining quality.

We then launched the agent to production, minimising risk to customers by having a human review every message before it went out. This served a dual purpose: it provided more feedback for us to iterate on the agent, and it allowed us to track its safety and effectiveness in real customer conversations without the risk of incorrect or unhelpful customer outcomes. Once we established that the agent was safe and provided value, we enabled it to speak to customers directly. From that point, we continued to perform QA on a sample of conversations to flag any potential issues. Initially we sampled all conversations for QA, and we decreased the sample rate over time as we gained confidence in the agent.

An architecture diagram that shows how the agent works at a high level, customer message goes through input guardrails, then to decide action and generate an answer with output guardrails

Scaling Through Observability and Validation

Following the initial rollout, we focused on increasing our resolution rate and coverage; the breadth of queries the agent could handle, while preserving a high bar for reliability. To do this, we moved to a layered evaluation stack that catches issues at different levels of the system:

  1. Component-level evals: Functioning like unit tests for individual prompts and guardrails.

  2. Answer-generation evals: Measuring the retrieval pipeline using semantic similarity and LLM-based judges.

  3. End-to-end evals: Testing the entire system against anonymised, real-world customer conversations.

We also established a tight feedback loop for our knowledge base. Any QA failure or low satisfaction score was triaged quickly; if the root cause was a knowledge gap, the relevant article was updated within hours. This prevented inaccuracies from compounding over time and helped the agent improve through structured operational feedback.

We introduced intent-gated customer context, such as transaction history, so the model could use relevant information without being overloaded by unnecessary data. This led to a 10 percentage point improvement in resolution for transaction queries.

We also learned the importance of system-level validation. A change that improves one component can still shift failure modes elsewhere in the workflow. By evaluating the system as a whole, we made sure improvements in efficiency translated into better customer outcomes, not just better local metrics.

Process-Following and Tool Orchestration

So far, the agent we have described primarily answers questions. That is an important part of customer support, but customer operations also require specialists to follow defined processes, inspect customer accounts, and take actions such as replacing cards, reporting fraud, or setting up direct debits.

The first version of the Ops Agent had two key limitations. Help articles provided context on how customers use Monzo, but not enough about internal support processes. It also could not dynamically read information or take actions on a customer’s account. To address this, we extended the Ops Agent to support process following and tool calls.

For process following, we took inspiration from the Agent Skills standard. A process is a human-readable set of instructions written in Markdown, with metadata about when to use it and references to the tools the agent should call.

When a customer sends a message, the agent checks whether it should follow one of the pre-defined processes. If so, it pulls the relevant process into context, generates a plan based on the customer’s situation, and updates that plan as tasks are completed or new information becomes available.

This moves the system from passive information retrieval towards active process orchestration. The agent is no longer only deciding what to say; it is learning how to execute operational workflows.

To support this, we overhauled our evaluation system. These conversations are longer and more complex, so the agent often deviates from the original conversations in our test set even when it reaches the same outcome. To handle this, we started simulating users in evals using LLMs grounded in real conversations.

Tool calls added another layer of complexity because actions can change state, affecting future tool responses. Instead of relying on brittle mocks, we built a simulated environment where tool calls operate against realistic state. For example, blocking a card updates its status so future tool calls receive accurate responses.

We validated this approach by shipping a production workflow for finding a missing refund. We found that the agent could resolve many of these conversations end-to-end, without involving a human specialist. Since then we’ve moved on to a workflow for ordering replacement cards for customers who have become a victim of fraud, have a broken card, or have simply lost it. This workflow involves taking actions, rather than just reading data.

These workflows are just the beginning. By using the agent along with deterministic workflows, we are progressively automating increasingly complex customer operations work.

To that end, the next challenge we’re looking to solve is scale: identifying the right processes to automate, implementing the required tools, and evaluating workflows end-to-end. The first processes required significant manual effort, but we are now focused on streamlining and automating more of this development path.

This is the transition from answering questions to executing operational work. The long-term opportunity is not just faster support, but systems that can reliably handle more of the operational complexity behind customer interactions.

Conclusion

We have laid out our journey and learnings as we have gradually advanced from question answering to orchestrating processes. Our system is now capable of navigating the non-linear reality of customer conversations while maintaining a high level of precision and ensuring good outcomes.

Robust validation mechanisms have been central to our strategy for scaling safely. We have relied on subject matter expertise to ground LLMs in verified facts, building scalable and repeatable evaluations alongside human-in-the-loop validation before releasing agents into production. These technical foundations create a continuous feedback loop that assesses and improves the quality of service we provide to our customers.

The road ahead focuses on the long tail of operational complexity. By scaling our process-following framework across the hundreds of specialised intents handled by our teams, we are closing the gap between intent and action. We are moving towards a model where the system handles the heavy lifting of process execution, allowing specialists to focus their expertise on nuanced cases that truly require human judgement.

It is still early, but the direction is clear and our trajectory is strong: customer operations will become faster, more consistent, and more intelligent, while retaining care and judgement when it matters most.

Interested in a career at Monzo?

If what you’ve read here resonates and you’re passionate about making money work for everyone, we’re hiring machine learning engineers, data analysts, backend engineers, and many more roles across Monzo! Take a look at our careers page to see if we have the right role for you.

We’re not talking about the crunchy, tasty kind. These cookies help us keep our website safe, give you a better experience and show more relevant ads. You can learn more about our cookie policy.

We use 4 types of cookie. You can choose which cookies you’re happy for us to use. For more detail, and a list of the cookies we use, see the Monzo cookie policy.