Building an extension framework for dbt

Read the article

dbt is a fundamental piece of Monzo’s data platform. We adopted it back in May 2019, and we’ve dramatically grown the number of users, data models, and the sheer volume of data it handles. We’re currently at ~6.9k models, with ~150 people making regular contributions to our analytics repository.

Growth of models, tests, and committers since August 2019

Moving to dbt gave us huge improvements in the development workflows of our data pipelines; folks could make safer changes to models in less time than they could before and have more observability over the relationships between the models they were building.

But we began to see users struggle to interact with models, compiling dbt took too long, and we fell more and more behind the current version. At the same time, more engineers were joining Monzo that had used dbt previously and wanted to use new functionality that we couldn’t support. Something needed to change.

We were running a fork of dbt and knew we needed to upgrade, but wanted to avoid the hassle of maintaining a fork that would likely fall behind again and have similar issues. So we made the big decision to build a whole new extension framework that would give us more flexibility.

This is the story of how that happened…

A short history of Monzo and dbt

At Monzo, we forked dbt almost as soon as we adopted it back in 2019. We needed to adapt dbt to fit into our already-huge data platform, and it couldn’t meet our needs as it was. We discussed contributing back to dbt core at the time, but decided against it as the modifications we needed to make were specific to Monzo’s needs and we felt that our modifications would be too specific for Monzo to benefit the community.

We’re heavy Airflow users. Before adopting dbt we had the ability to map Airflow tasks to data models. We wanted to keep that capability, but found dbt compiled too slowly and the size of our Directed Acyclic Graphs (DAGs) brought down our Airflow instance. So we made the decision to fork dbt and we added a feature that allowed us to share the manifest between compilations among some other features too.

A Slack message from May 2019 showing how we decided to “hack” in the ability to dry-run dbt against BigQuery early on

We made the decision to help us move fast, but hindsight is a wonderful thing. If we knew what we know now, we may have made a different decision.

By forking dbt, we opened Pandora’s box. We continued to add new commands to dbt and modify various pieces of functionality, tailoring the tool to our organisational needs. Many of these customisations made it into part of folks’ daily workflows, and they became a core part of working with data at Monzo.

Though we had built some cool customisations, a couple which are covered in Luke’s post, that didn’t stop our data team and analytics engineering team pining for the many great new features dbt labs had introduced in dbt core since v0.15. Not only were we lacking new features, but the more we fell behind, the less we could leverage the benefits of dbt’s ever-growing community.

As the team responsible for data tooling, we wanted to provide the users of our data platform with all the tools they need to be amazing at their jobs. It became clearer and clearer that keeping up to date with dbt core was a way to help achieve this.

The need for more than dbt core

dbt core does the things that it does well, albeit slowly if you have a lot of models, and generally will work for most folks in most use cases.

However, as a dbt project gets bigger in terms of number of contributors or data models, there will almost certainly be a need to do more things with the artefacts that dbt generates, beyond what dbt core offers. Often these things will be quite specific to the environment in which the dbt project has grown in.

For example, in the context of CI, we may want to assert things like “all models tagged with model_type=entity must also be tagged with criticality_tier, have 60% test and documentation coverage across columns, and have at least one code owner”. Or, for development purposes, we may find that a command that automatically generated or synced schemas for a given model selector expression, such as sync-dbt-schemas -m +my_entity, could be useful to streamline development workflows.

Given dbt’s vibrant community, it’s no surprise that many utilities have already been built that make working with dbt and maintaining larger projects with larger and more varied teams easier. This is great, but on deeper inspection also reveals a fundamental deficit of dbt as it exists today. If you look at the implementation of many of the tools, a pattern begins to emerge; if you want to develop a tool that uses dbt’s artefacts (DAGs, parsed models, parsed tests), then your choices are limited to some combination of the following:

  • shelling out to dbt commands and parsing the returned results from the CLI

  • importing dbt in a python script and using the returned result objects from dbt commands directly

  • loading the manifest.json or graph pickle files and parsing them

In case you don’t believe me, I collected some examples [dbt_test_coverage, dbt_coves, dbt2looker, dbt-metabase, dbt-exposures-crawler, olivertwist, dbt-invoke]. Each implementation is bespoke, and is some duplication of work that dbt core already does internally. This, to me, highlights the problem: there is no good API for accessing information about dbt’s DAGs.

As Pedram Navid puts it:

For an open-source CLI, it’s exceptionally hard to integrate well with it. Something as seemingly simple as ‘getting the name of all models without running the entire project against a warehouse’ is actually impossible. No one wants to parse the undocumented hell that is manifest.json, but it’s our only choice.

Time for change

So we needed more than just dbt core, but our own fork with additional features wasn’t working fast enough and it was lagging behind significantly. We tried a few things to speed it up, including an engineer taking on the challenge of rebuilding it in Go. Whilst a cool project, we went from maintaining one tool to two, including one we didn’t write and couldn’t maintain without additional engineering help. Eventually we combined the two, but that’s a side quest we can cover in a different post.

We knew we needed a different approach so we went back to our original decision to fork dbt. Was it the right one? Should we upgrade the fork or try something completely different?

To fork or not to fork?

We had been working with our fork for so long, and had updated our fork through three versions. The process was difficult because the code changed so much between versions and it was hard to find the functionality we’d injected in. Most of the time we’d upgrade then find we’d inadvertently introduced bugs.

We decided to upgrade to the latest version of dbt, but since we had to retain our custom functionality, our choices were limited to:

  1. Staying on v0.15 for the time being, backport any strongly-desired new features.

  2. Rebasing our fork on the latest version of dbt.

  3. Rebuilding our custom functionality, ideally in a way that doesn’t involve forking the codebase.

  4. Contributing back to dbt core.

The design

We decided to build an extension framework that would allow us to hook into the dbt codebase at explicitly defined points so we could build each piece of functionality as its own Python package and maintain it in isolation with all other customisations. This would help us see exactly where we’re hooking into the dbt codebase and where we would be changing or augmenting it.

However, this approach still requires us to couple some behaviour to dbt’s private internal logic. Because dbt doesn’t provide an API to extend behaviour, this was something we were happy to begrudgingly accept. By being explicit on where we relied on dbt’s internals, if we needed to upgrade dbt, we could more easily see where changes were needed. We could also add unit and integration tests to each package, which would make upgrading a lot easier.

As with all design decisions with Monzo, this was not without its trade-offs. We knew that there would be an increased cost of maintaining separate packages. We decided the cost of maintaining those packages would be less than the cost of maintaining a fork.

We also had some discussions about building the functionality through dbt’s Jinja capabilities. Ultimately we decided that although Jinja had its strengths, the solutions we would have to build would be too complex for the time we had. We also knew that what we could access in the Jinja environment would be too limited for the functionality we needed to build.

Building the extension framework

This is what the code for an extension looks like (simplified):

DBT Code Snippet 1DBT Code Snippet 2

And here’s it running:

The extension code above running against the “Jaffle Shop” example project, showing the selected nodes being printed to stdout

Given access to dbt’s internals, it means we don’t have to write any code to parse the manifest JSON file, manually traverse the model dependency tree, or re-implement model selector expressions.

We also have the concept of a patch. An extension is an additional command added to dbt, and a patch is a modification of an existing behaviour in dbt.

From Pedram Navid’s article:

To this day, I can’t give dbt a file name and hope it figures out what I mean, but instead I still have to remove the .sql at the end. Small gripes.

We can solve this with a patch:

DBT Code Snippet 3DBT Code Snippet 4

Here’s it running:

The patch code above running against the “Jaffle Shop” example project, showing the selector argument being modified before the command is run.

Here are the extensions and patches we have internally so far:

A list of extensions we have built at Monzo, generated using the dbt list-extensions command included in our extension framework.Part one of a list of patches we have built at Monzo, generated using the dbt list-extensions command included in our extension framework Part one of a list of patches we have built at Monzo, generated using the dbt list-extensions command included in our extension framework

Future improvements

We’re aware that dbt Labs is planning some changes to its internal APIs and it’s likely that as each version of dbt is released we will have to make some changes. However, we’re confident our approach will mean a less painful upgrade experience than maintaining a fork.

If you’re interested in helping shape the future of our data tooling and practices, join us! We have a few open data roles across Monzo including: