Using topic modelling to understand customer saving goals

Read the article

Pots are a great way for our customers to set money aside from their main account. Our customers open Pots for various reasons; to save towards big goals such as a house deposit or holiday, putting money aside for monthly bills, or simply accruing leftover money for a rainy day. Pots are one example at Monzo where we allow customers to use free text forms for the name, including emojis.

A screenshot of the Edit Pot screen where customers name their Pots

This creates a complicated dataset with almost 5 million unique Pot names ever created. Each month 350k new Pots are made, so this dataset continues to grow. The free-text data form, the inclusion of emojis, and the fact that Pot names are generally short means it is difficult to understand how our customers are using Pots and what for, without scanning through all the Pot names individually.

One approach that can help us cluster Pot names is topic modelling, a generally unsupervised machine learning technique which can detect patterns in text data. In this post, we describe how we use topic modelling to understand how customers are naming their Pots, and some insights we see in the categories produced. Knowing what our customers are saving for helps in our research at Monzo when deciding on what product features best serve our customers’ needs.

Data

Machine learning models work by using data to learn (training data) so that they can then predict based on new unseen data (live data). For our topic modelling we used a sample of 800k unique Pot names created by our customers. 50% of those are just one word, such as “Savings” or “Bills”. And our longest Pot name is 574 words!

A large part of working with text data is that it requires a lot of data cleaning in order for it to be more useful for any models to interpret. This included converting words to lowercase and removing punctuation. The data also contained a number of typos and abbreviations, so we had to do some work to find and replace words of a similar meaning, e.g. “monthly” to “month”, and “bday” to “birthday”; this practice is referred to as lemmatization.

Finally, 8% of Pots contain at least one emoji. Some of these can be really useful in identifying what the customer is saving for, so next we replaced the emojis with their text equivalent, e.g. 🏠 > :house:.

Topic Modelling

The method we chose to help understand Pot names is called topic modelling.

Topic modelling is a machine learning technique which detects patterns in words and phrases used within a selection of documents to categorise them into a pre-defined number of topics.

When working with text data it can be useful to group “documents” (in our case, Pot names) into themes. Topic modelling allows us to do this by identifying words and phrases which are commonly seen together.

It works by counting how many times pairs of words appear together. Commonly it will look for words that sit next to each other in a sentence; if you think about longer documents, such as a book or a scientific paper, you want to ensure the two words are within the same context as opposed to in paragraphs discussing two very different things. However, as Pot names are generally very short we used a specific type of topic modelling called biterm topic modelling (BTM) that counts when words are used together in the entirety of the Pot name. Then by calculating the strength of the relationship between words, the model can group similar Pots together.

As an unsupervised technique, our topic modelling required a fair amount of tuning. One of the inputs to the model is the number of topics you would like your texts to be categorised into. To find the optimum number, the model was trained with different numbers of topics and the coherence score was calculated. This score measures the degree of similarity between high-scoring words in a topic. A larger coherence score means topics are more coherent, i.e. well categorised. After a few iterations, the number of topics with the highest average coherence score was selected. When studying the output of the final model, some themes were coming out, e.g. Pots focused on bills, holidays, or life events, such as birthdays or hen dos.

We used the output of the topic modelling to create a number of categories. Once we’d found an output we were happy with, we used the model to build lists of keywords, phrases and emojis to help us assign Pots to a topic. The benefits of this were that we would be able to determine exactly what was driving the categorisation and allowed us to account for edge cases where the model cannot.

One such example is that a large proportion of Pot names are just one word. As the mechanics of topic modelling works on understanding the context of words against others in the same document, this means that for one-word Pots there is no context to understand. By creating lists of keywords that were common across a topic it allowed us to categorise these one-word Pots effectively.

Output

The modelling gave us 20 distinct topics and over 500 words/ phrases/ emojis to categorise them. This covered things such as travel 🏝️, life events 🎉, generic saving 💾, and household bills 💸.

This modelling allows us to pull out insights on customer behaviour relating to Pots.

30% of Pots fall into the generic saving category

15% fall into travel.

We found that the creation of different topics showed seasonal variance, for example;

🎁 Creation of “life events” Pots peaks towards the end of the year; we found this to be customers saving money for Christmas.

Trend of “life events” Pots created over time

🥶 Pots used for “travel” are commonly created in the new year, perhaps as people try to fight those January blues! We also see increases starting in June as people prepare for their imminent summer holidays. Creation of these Pots declined during the pandemic, but the analysis shows evidence of it recovering.

Trend of “travel” Pots created over time

Written language can remove some of the emotion behind our communication. At Monzo, we believe emojis are an important form of expression and Pot names are no exception. Using the output of the topic modelling, we can see which emojis are used most frequently for each topic.

Below are the top emojis for some of the topics we found.

  • For “life events” Pots we see Christmas and gifting are the main themes appearing through emojis.

    Top 5 emojis used in “life events” Pot names
  • For “travel (destination)” pots, i.e. travel pots that mention a destination in the name, it looks like Italy, Spain and the United States are top destinations for our customers.

    Top 5 emojis used in “travel” Pot names
  • And for “pets” pots, customers tend to include the emoji of the pet they are saving for — dogs win this round!

    Top 5 emojis used in “pet” Pot names

Emojis add further insight into our analysis and understanding of how our customers use Pots. But we can also use this information to add a further layer of categorisation, using emojis to help assign Pot names to a topic.For example, of the “travel (destination)” pots that included an emoji, 58% contain a country flag emoji, giving us insight on the type of holiday the customer is saving for. Furthermore, Pot names such as “Holiday ⛷️” and “Holiday 🌴” are identical from an NLP perspective, but the emoji tells us the type of holiday the customer is saving for, with perhaps different goals and end dates.Pots are a great feature at Monzo which gives our customers the ability to put aside some money from their main account, and personalise it towards a savings goal. Our modelling found that customers have a variety of savings goals, whether it be holidays, life events, or household bills. Knowing what our customers are saving towards help Monzo better research what product features could improve the experience and serve our customers’ needs.


👩‍💻 Come and join us

If you love working on these types of data challenges, you should come and join us! We’re hiring for several roles in our data team, including:

  1. Director of Data Science (Product)

  2. Senior Data Scientist

  3. Senior Analytics Engineer