How we scaled our data team from 1 to 30 people (part 1)

Read the article

In the last three years at Monzo, we've scaled a world class data organisation from zero to 30+ people.

When I built my first data organisation five years ago in a small start-up, it took me a year of zig-zagging to get to a decent stage. When I joined Monzo back in 2016, we achieved the same result in a month because I'd learnt lots of lessons before. And it's no surprise that implementing major changes once your team's already huge is a lot harder than maintaining an existing high standard as you scale it!

For other growing companies looking to scale their data teams, we wanted to share the lessons we've learnt so far, to save you some of the trouble! 😅

  • In this post, we'll walk through how data works at Monzo, and share the high level principles we've used to scale the team.

  • In part two (which we'll publish soon!) we'll share the challenges we've faced at different stages of scaling, and what tactics we've used to overcome them.

How data works at Monzo

Back in November 2016 we published a blog post called "Laying the foundation for a data team." And in it, we shared our vision for the data team we wanted to build. In a nutshell:

We want to build a world where any data scientist can have an idea on the way to work, and explore it end-to-end by midday.

This lets us make quick decisions, and significantly lowers the cost of exploring new analytical questions, which in turn hopefully helps us innovate much faster. It gives data scientists at Monzo a high degree of autonomy, and the ability to explore many problems in a short period of time.

But to make this a reality, we need to have the right structure, environment and tools.

Our ETL stack

An ETL stack is a set of tools that companies use to extract (E), transform (T) and load (L) data from one database into another.

Every backend engineer is responsible for data collection

Every backend engineer at Monzo is in part also a data engineer. Whenever somebody introduces a new backend service, they're responsible for emitting so-called "analytics events" (logs) which are loaded in real time into BigQuery (There's a bit more magic that happens here like sanitisation, but that's a story for another post!). This means we don't need dedicated data engineers to create data collection pipelines.

ETL is everyone's responsibility, we don't have a business intelligence team

Data scientists should be able to autonomously conduct the whole analytical workflow end to end without having to rely on other teams to first create a usable dataset. This means that we have opted for a structure where there are no gatekeepers for ETL processes, i.e. no traditional BI (Business Intelligence) teams.

SQL is sufficient to get the data transformation work done

If we expect data scientists to own ETL end to end, this process needs to be so simple that any person with good SQL skills can get up to speed within one month after they join. None of the data scientists we have hired so far were used to this process, so we needed to train everyone. In fact, we have an environment where "E" (extract) and "L" (load) from ETL aren't really necessary. All data by default lands in a semi-structured format in BigQuery in realtime. Even when we need to load CSVs we try to stick to the same process! So all our data scientists are really responsible for is the "T" (transform). We do 98% of this work via SQL. And we use DBT (Data Build Tools) to manage this transformation work reliably.

Shared data models are the backbone of our data work

We use the term "data models" to describe the output of the transformation process from analytics events or logs to higher level abstractions. All data scientists work on the same mono-repo of these data models.

Data models are owned by the community and everyone benefits from good ones. At the same time, everyone suffers if we can't maintain them at a decent standard. As you can imagine, this also comes with the challenge of keeping the quality high without suffering from "the tragedy of the commons".

We minimise moving data around between servers

We try to avoid moving data between servers, and instead process it where it's located (which is mostly in BigQuery). This means we try to avoid anything non-SQL in the ETL pipeline as much as possible as it significantly increases the entry barrier for new analysts, makes things less scalable and makes things harder to maintain. BigQuery + SQL works 99% of the time. This keeps everything simple.

Analytics engineers are key to our team's success

Our setup is only possible if you have a strong team of analytics engineers. Those engineers should be enthusiastic about building powerful systems which let data scientists do their work without being super technical and without needing to understand what happens under the hood. Analytics engineers are responsible for all the stitching. And if they work on something that doesn't scale linearly with the number of data scientists on the team, they're probably not spending their time on the right thing. If you're interested in the emerging concept of analytics engineering have a read of this post.

Our ways of working

One of the most important principles we live by is to optimise for the medium term speed of the company, rather than the short term speed of individuals. This is absolutely key for us, let me explain why.

Most analysts usually default to writing ad-hoc queries and Notebooks/R/Excel

The default behaviour of almost all data scientists and analysts I've worked with so far is, unfortunately, to write ad-hoc queries and explore the results quickly in Jupyter Notebook/R/Excel. Most of the time, this is the quickest thing you can do. But it leads to a lot of one-off pieces of work that rarely benefit anyone else in the company.

We ask data scientists to assume that there's almost a 100% chance that someone will ask them to update a piece of work at some point in the future. In the worst case, that'd mean six months into the job you'd be spending 50% of your time updating your previous work instead of doing something new.

Instead, we want every piece of data work to benefit the wider data team

We aim to work in a way that means every individual piece of analysis makes data at Monzo a little bit better. We encourage our data scientists to spend maybe 30% longer answering a question, if it means they do it in a way that's easily reproducible and benefits other people in the company.

For example, this could mean adding missing dimensions to existing data models rather than writing ad-hoc queries, as well as doing the analysis in Looker if it's simple enough. If everyone followed this principle, in most cases we wouldn't even need to figure out where to find the raw data. It'd already be part of the curated data models that someone had to figure out before you.

Having a new dimension in a data model also automatically lets everyone in the business use it through Looker. It also means people are a lot more likely to build on top of your work rather than starting from scratch every time. This lets us solve more business problems faster and encourages collaboration between data scientists across different areas.

This way of working isn't purely about reproducibility. You could achieve reproducibility by putting your notebooks and ad-hoc queries into GitHub, but that wouldn't automatically let other people discover and benefit from them.

We enable self-serve analytics and reproducibility through Looker

Looker is a staple in our analytics stack that empowers non-technical people within the business to self-serve data questions in ~80% of cases. In fact, more than 60% of people at Monzo are weekly active users on Looker! It's key to success of data at Monzo because anything you do in Looker is almost by default reproducible and "always-on". This lets our data scientists focus on more complex and interesting questions.

We're explicit about data accuracy requirements

We make a distinction between two different types of datasets. For many data-informed decisions, it's fine when something is 98% correct and you're optimising for speed. In other cases (like financial reporting, for example) data needs to be 'bang-on'.

If you treat all the data the same and try to enforce very high accuracy everywhere, you'll often find yourself sacrificing speed. We differentiate between crucial and non-crucial data sets and enforce tougher change management procedures on anything that's tagged as crucial. This distinction also lets us give the whole data community the freedom to contribute to data models.

Our approach to hiring

We look for people who are motivated by making an impact on the business

When you're trying to create a world with a lot of autonomy and want to give people the freedom to explore it, this influences the people you hire.

Broadly, I've come across two types of data scientist: those who are deeply interested in the tech for the tech's sake, and those who are mostly motivated by business impact. And we've found being purely tools-oriented doesn't bode well in our environment. That's because data scientists are most valuable when they work closely with product teams to enable better data products or better data-informed decisions, faster. As a company, we're not yet at a stage where we'd hire purely research-focused data scientists.

So when we hire, we mainly look for people who want to have an impact on the business, and optimise their choice of tools to make the maximum impact. It's also important that we hire people who are curious and want to proactively come up with relevant insights, and who want to shape the direction of our product. In our data team's principles we call this trait the 'mini CEO mindset.'

We make machine learning part of the job to attract the best people

To hire the best people, we had to jump on the machine learning train! There's often a general perception that machine learning is some magical tool that'll fix all your problems (without applying common sense!). But we still encourage all data scientists and analysts to solve problems with machine learning – as long as it really is the best way to find a solution.

Adding an ML component to our data roles also meant we changed our titles to "Data Scientist". And based on some anecdata, this helps us attract more people to our roles. A few years back we A/B tested different titles for open data roles. And by changing the titles from Data Analyst to Data Scientist (without changing the job description), we could attract close to twice as many applications, of significantly higher quality too!

That's it for part one – please let us know your thoughts or feedback in the comments. If you're interested in reading on about the challenges we've experienced while scaling the team, stay tuned for part two!

If you're keen to redefine how banking works, want to shape the direction of a modern data team, and push the boundaries of what's currently possible, please get in touch. We're currently looking for leaders for the following roles:

We'll also soon be hiring for a Data Engineering Lead and a Data Science Lead for Product. If you're interested in those roles please contact 😃