Fighting fraud in banking is an ongoing and adversarial problem. Fraud typologies evolve continuously, take many forms, and adapt quickly to detection and prevention mechanisms. The consequences are not evenly distributed. Financial loss disproportionately affects customers who have less financial resilience; for some, a single fraudulent transaction can materially disrupt their ability to meet essential expenses. Protecting customers from fraud is therefore not only a technical challenge but a matter of financial stability and trust.

Fraud is not a single problem but a collection of behaviours, ranging from organised scams to opportunistic account misuse. As patterns become detectable, they shift, and new typologies emerge. Signals that were previously predictive can lose effectiveness. This dynamic environment increases the risk that emerging behaviours will go undetected, with the greatest harm typically borne by those least able to absorb losses.

In this post, we describe the modelling formulation we have used for fraud detection and present an extension based on learning a shared fraud representation using multi-task neural networks. We evaluate this framework on unauthorised card fraud and analyse its ability to generalise to rare and previously unseen behaviours.

Where we started

Fraud detection is typically formulated as a supervised machine learning problem. Given a transaction and its associated features - for example, behavioural, transactional, and contextual signals - the objective is to predict whether the transaction is fraudulent. In its standard formulation, this is a binary classification task.

These problems have distinctive statistical properties. Fraud is rare, accounting for fewer than one in ten thousand transactions, and labels may be delayed or only partially observed. The dataset is therefore highly imbalanced, with very few positive examples relative to legitimate activity. This imbalance affects optimisation, calibration, and evaluation, particularly when recall on rare subtypes is operationally important.

For tabular datasets with these characteristics, the state of the art is typically tree-based gradient boosting methods such as LightGBM or XGBoost. These models handle heterogeneous feature types effectively and capture non-linear feature interactions without extensive preprocessing. In imbalanced settings, boosting methods can incorporate class weighting directly into the loss and iteratively focus on hard or misclassified examples. Their partitioning of feature space allows them to isolate small, high-risk regions associated with rare positives, rather than relying on a single global decision boundary. They are also computationally efficient and well-suited to low-latency inference in production systems.

Despite these strengths, this modelling approach has clear structural limits. Fraud is not only rare in aggregate; individual fraud subtypes are often orders of magnitude rarer still. This makes purely subtype-specific modelling brittle. When sufficient labelled data exists, highly targeted models can perform well, but they are tightly coupled to historical patterns. When behaviour shifts, their performance degrades.

At the other extreme, aggregating many behaviours into a single classifier increases data volume. Still, it introduces a different failure mode: the model is dominated by more frequent patterns and allocates limited capacity to rare or emerging behaviours. In practice, this trade-off constrains generalisation.

Within Fincrime at Monzo, rather than modelling individual fraud types in isolation, we asked whether we could learn a shared representation of fraud. This structure transfers across behaviours and improves robustness to new or shifting patterns. This approach is not limited to financial crime use-cases. Once we have successfully proven the application of multitask learning to fraud, we plan to extend the approach to other domains characterised by heterogeneous positive classes and distributional shift.

Learning a Shared Fraud Representation

Diagram of a multi-task machine learning architecture. From the left, input features flow into 3 shared feed-forward layers. From the final shared layer, the architecture splits into 3 task heads, predicting fraud, the payment method and the fraud type, respectively. Finally, the output of the fraud prediction head, and the original input features, are passed into a LightGBM prediction model.

In fraud detection, the positive class is heterogeneous. Different fraud subtypes exhibit partially overlapping but non-identical patterns. Modelling them independently risks overfitting to subtype-specific signals. Modelling them jointly as a single label risks diluting rare but important behaviours. The core challenge is therefore representational: how do we learn structure that is shared across behaviours without collapsing their differences?

Multi-task learning provides a principled answer. Rather than training a model on a single objective - such as a binary fraud label - we train it to optimise multiple related objectives simultaneously. These tasks may correspond to different fraud subtypes, related risk indicators, or auxiliary signals derived from the same transactional data.

Architecturally, a multi-task model consists of a shared neural network backbone that learns a common representation from the input features. On top of this backbone, task-specific heads produce predictions for each objective. The model is trained jointly, with gradients from all tasks updating the shared layers.

This structure introduces an inductive bias towards shared structure. When tasks share underlying drivers, joint optimisation encourages the backbone to learn features useful across behaviours rather than narrowly specialised correlations. Rare subtypes benefit from shared statistical strength, while common patterns stabilise the representation. The model becomes less dependent on the precise fingerprint of any single behaviour.

In effect, multi-task learning shifts the focus from memorising historical patterns to learning transferable structure. This is precisely the property required in environments characterised by heterogeneous positives and distributional shift.

Results on fraud detection

We evaluated this approach on unauthorised card fraud. This is a sub-type of fraud where a person’s credit or debit card information is used to make transactions, purchases, or withdrawals without the owner's knowledge or consent. This is a perfect test-bed for multi-task learning because it contains multiple distinct behaviours and a mixture of common and extremely rare subtypes.

We compared:

A strong LightGBM baseline (the previous production model)
A single-task neural network trained solely on the fraud label
A multi-task neural network with a shared backbone and multiple task-specific heads

The results are shown in Figure 2, below. The single-task neural network underperformed the LightGBM model, consistent with the strong performance of gradient boosting on tabular data, showing a 0.92x performance relative to the production baseline.

In contrast, the multi-task model achieved higher performance at comparable levels of customer impact. For some fraud types, we observed relative improvements of approximately 30% compared to the LightGBM baseline, particularly for rarer subtypes. The difference between the two models shows that improvements arise from the multitask structure and the representation learning, rather than from deep learning alone.

We also conducted a more adversarial evaluation. An entire fraud subtype was held out during training, and models were evaluated on this unseen behaviour.

The multi-task model generalised better than both the tree-based model and the single-task neural network. Although performance degraded for all models relative to in-distribution evaluation, the multi-task approach retained substantially higher recall.

This result supports the underlying hypothesis. If the shared backbone is learning transferable structure rather than subtype-specific signatures, it should perform better when exposed to behaviours it has not explicitly observed. Detecting new scams before substantial labelled data accumulates requires precisely this ability to generalise beyond historical patterns.

The bar chart shows that a Vanilla MLP performs 8% worse than a LGBM baseline, however adding multitask representations improves performance by 30%

Future

Although motivated by fraud detection, the underlying approach is applicable more broadly.

Many real-world settings exhibit similar characteristics:

A primary task with sparse or delayed labels
Related auxiliary signals that are less sparse
Distributional shift or the emergence of unseen behaviour

Multi-task learning provides a structured way to incorporate related signals into representation learning. Domains that currently rely on multiple narrowly scoped models may benefit from learning shared structure instead. For us, this work represents a step towards reducing reliance on purely reactive fraud detection. Traditional approaches often require new behaviours to be observed, labelled, and incorporated into retraining cycles before models can respond effectively. By learning shared structure across related tasks, multi-task models can provide stronger coverage from the outset and adapt more effectively to emerging patterns.

Interested in a career at Monzo?

If what you’ve read here resonates with you, we’re hiring for Machine Learning Scientists, Data Scientists, Engineers, Product Managers and many more across Monzo! Take a look at our careers page to see if we have the right role for you.