Towards Real-Time dbt
Here's a framework for thinking about reducing the time between when raw data is available to transform with dbt, and when it is delivering value to your customers.
Transforming your data with dbt solves a bunch of important problems for you: it version controls your transformations, it allows your team to collaborate easily, it encourages documentation, and it unlocks easy testing of your critical data. But, there’s one thing it doesn’t help you with: reducing the amount of time between data originating somewhere in your business – from your application, from your SaaS tools – to when that input data is transformed into useful results for your business.
If you could somehow speed up the amount of time it takes for new input data to show up in your models, you could send out customer emails faster, detect fraud sooner, and keep your business-critical dashboards fresher throughout the day. As in all engineering, the options for making this happen represent sets of trade-offs.
In this post, we’ll cover some routes you might take to see updated output of your dbt models faster: changing their materialization types, bumping up their
dbt run schedule, hacking together a lambda view, or swapping out your data warehouse altogether. Let’s step through each option and explore the pros and cons - or trade-offs - of each implementation. If at any point you’re interested in learning more or asking questions, please reach out to us on our community Slack or Twitter !
dbt officially supports four types of materializations: table, view, incremental, and ephemeral. These dictate how your transformation will be persisted in your underlying data warehouse once it is
table will create a table,
view will create a view,
incremental will create a table with some extra bits, and
ephemeral will use a CTE to return results instead of physically persisting anything in your warehouse.
In theory, the fastest way to see new input data in the output of your dbt models is to use a
view materialization. Let’s dig into why that’s true.
dbt run models backed by the
view materialization, you aren’t actually transforming your data at that moment. Instead, you’re just stashing the definition of your transformation – your SQL query – into your database to be used later. Your data will only be transformed when something – a person, an application, a dashboard – tries to read from that model later on.
Another way to describe this is that when you back your models with the
view materialization, you’re electing to transform your data
on read, meaning when someone or something tries to read it. Transforming your data on read is the simplest way to get fresh results from your dbt project. At any point when you
SELECT from one of these models, you can be sure that you’re reading the most up-to-date data that’s available. If a row is available in your underlying database at query time, it should be reflected in the result that’s returned to you.
This isn’t true when your models transform your data
on write, or as they are being persisted to the database. An example of this is the
table materialization. Each time you
dbt run a model backed by a
table, dbt will run your query, transform your data, and stuff the transformed data into a new table. While this will improve a different type of latency – query latency – it actually means that you will see less fresh results between transformations. As new input data arrives, you now have to wait for the next
dbt run to see that data reflected in your results.
As long as querying a view is fast enough for your end users, the
view materialization is likely the fastest choice for incorporating new input in your results. But, at a certain point you might hit their fatal flaw: views aren’t a scalable solution.
If you’re materializing your models as views, there are three scalability issues you might hit. The first is that as the amount of data you’re transforming grows, your query latency will get worse. It will take longer for results to be returned to you as you try to read from your models. The second reason is that if your transformations become more complicated, your query latency will also get worse. And the third reason is that you might end up paying a lot.
Each time you transform your data, you’re paying some cost: in wall clock time, in opportunity cost, and in compute credits. And when you are transforming your data on read, you’re paying that cost each and every time you
SELECT from a view. So if you expect a model to be used often, or by many people, this can add up quickly.
So even though materializing your models as views is the simplest way to generate fresh results, it might not scale as your project grows. Let’s explore some other, and potentially more scalable, solutions.
So you want to get from raw data to transformed results as fast as possible, but you also want to do it in a way that's scalable. You’d prefer to limit your costs, and you’d prefer to put a reasonable cap on your query latencies. If you can’t accomplish both goals with models that transform data on read, is it possible to do it with models that transform data on write?
The answer here is maybe! It depends! Let’s take a look at one way this might work.
When you choose a materialization type that transforms your data on write, you’re actually transforming your data with each
dbt run and persisting that transformed data to your database. When you go to
SELECT from that data later, you’re going to get those results back very quickly. You also won’t be paying the cost to transform that data with each
SELECT, so we’ve solved our scalability issue.
But, we’ve slowed down the amount of time it takes to turn our raw data into updated results. Where views always return fresh results, models that transform their data on write can easily fall out of date. All that needs to happen is for a new row of input data to arrive. Because your input data is only transformed on
dbt run, the output of your models will not reflect this new data. If your model is scheduled to be
dbt run every hour, it could be out of date for 59 minutes and 59 seconds of that hour. So, so far, this solution doesn’t satisfy both of our requirements.
But, what if we just
dbt run these models more often?
This is what’s called micro-batching! Micro-batching just means that you’re running your batch jobs more often and over smaller accumulations of data. And, logistically, it’s a pretty easy swap. All you really have to do is change the schedule of how often your model is
dbt run, and potentially some time-based constraints in your SQL queries.
Micro-batching might solve your problem if you’re reasonable about how often you run your models. It might just work to run your model daily instead of weekly, or every 10 minutes instead of every 60. But, it’s important to be mindful of the lower bound here. In discussing micro-batching, dbt asks you nicely not to. They warn that if you’re bumping your models
dbt run schedule enough that your models are running continuously – or overlapping – you may run into some tough, unexpected problems.
dbt running your models as often as you reasonably can, but your data still isn’t fresh enough: don’t worry! We’re not out of solutions yet. The next set of proposals dare to ask: what if we let our tools handle the hard stuff for us?
Materialized views, incremental models, and lambda views
Transforming our data on read returns the freshest data. Transforming our data on write scales as our projects grow. What if there was some sort of sweet spot between the two? In this section, we’ll explore a few different implementations of that in-between state.
The first option is to back your models with a materialized view in a traditional database. The first time you
dbt run these models, all of your data will be transformed and persisted into that new materialized view object. On subsequent
dbt runs, you will have the choice of how to refresh that view to reflect any newly-arrived input data.
In theory, this should mean that you only pay the cost to transform each input row exactly one time. You transform all the data you have at once, and then you transform additional rows on write or on read as they arrive. This solution would address our primary concern – seeing new input data in our results faster – while also keeping our other scalability concerns – like query latency and cost – in check.
Depending on your data and your underlying data warehouse, backing your models with materialized views might just work. But, for many databases, it’s not usually this simple in practice. With some databases, making materialized views work this way could be complicated, costly, or flat-out impossible. (If this is disappointing to you, it was disappointing to us, too! Keep reading to see how we’ve changed things with our materialized views.)
The second option pulls the complexity away from your database and puts it in the hands of dbt: dbt’s incremental materialization. The incremental materialization is dbt’s attempt to manually do the thing that you want your database to do with materialized views: efficiently update your models as new underlying data arrives.
To do this, dbt runs some SQL statements on your behalf with each
dbt run. As they explain in their docs, incremental models fall somewhere between
view performance – or that balance between on write and on read transformations that we’re looking for.
Incremental models get us from new input data to updated model results more quickly, and to date it’s probably the most popular solution! A big benefit is that you can use incremental models with any underlying data warehouse, where support for materialized views is still more limited.
But, the incremental materialization also comes with a few costs. The first is that using an incremental materialization for your models makes your project more complicated – dbt warns that it is mostly for advanced usage. And second, it’s not the most efficient way to solve our problem. To solve our end-to-end latency problem efficiently, and in way that scales nicely, we’ll need our database to maintain incremental state for us – not dbt. We’ll talk about this more in a bit.
Our third option splits the onus between dbt and the author of the dbt project: lambda views. Lambda views are a different approach to solving the same problem as materialized views and incremental materializations. But instead of asking your database or dbt to do most or all of the hard work, you manually orchestrate the outcome you’d like to see yourself.
The good news about lambda views is that they work. They return up-to-date results each time you query your model, meaning that you are able to see fresh results quickly. The bad news is that lambda views are the most difficult way out of the three to accomplish this goal. Because you’re not pushing the complexity down to the database or dbt, you have to manually create the stacks of models and the performant filters required to achieve your result. While this is totally possible, it’s the solution that puts the most burden on you.
Each of these options present a more scalable solution to our problem of turning inputs into transformed output as quickly as possible. But none of them are entirely satisfying yet. They all come with some cost – due to unnecessary computation, extra computation, or brittle configuration. At this point you might be wondering, why isn’t there some tool that’s fully solved this problem for us?
Incremental view maintenance
We want to speed up the time it takes to turn raw data into generated results, and we want our solution to scale as our amount of data grows and our transformations get more complex. What we really want is for something to transform our input data as it changes – for data freshness – while only transforming that input data once – for scalability. By describing our goal and our constraints, we’ve found a new solution to our problem: incremental view maintenance.
Incremental view maintenance is a technique for maintaining up-to-date results of a function as its underlying inputs change. This technique transforms data on write: it continually transforms input data as it arrives, which in turn continually updates the result.
Incremental view maintenance is one strategy for keeping materialized views up to date. Earlier, I mentioned that materialized views come close to that sweet spot we’re looking for – they scale better than views, but we do have to do some extra work to ensure they’re returning fresh results. But that’s because most materialized views aren’t maintained incrementally. Most materialized views actually transform data more like micro-batches: they transform data that’s arrived since their last refresh on some interval or due to some trigger.
But when we create incrementally-maintained materialized views – or materialized views whose results are updated incrementally, as each new row of input data arrives – things look a little different. These materialized views come with the main benefit of transforming data on write: they’re scalable. They only transform incoming data once, regardless of how often it’s read. But the different refresh strategy gives these materialized views an unexpected benefit: they improve our end-to-end latency, too.
Just like the views we examined at the beginning of this post, incrementally-maintained materialized views always return fresh, up-to-date results when queried. So if you’re building a model on top of an incrementally-materialized view, each time you
SELECT from that model you will see an up-to-date result.
Materializing your dbt models as incrementally-maintained materialized views is the most scalable way to improve the end-to-end latency of your dbt projects. And, they come with an operational perk. For other materialization types that transform your data on write, you have to continually
dbt run those models to generate fresh results. With incrementally-maintained materialized views, you only have to
dbt run your model once. Incrementally-maintained materialized views continue to keep your results up to date as new input data arrives, without any scheduled or manual refreshes.
Updating your dbt projects
In this post, we’ve explored a few different ways you could improve the data freshness of your dbt projects. Depending on the use case you’re working on and the constraints you’re working within, some of these solutions may have stood out more than others. Some of these swaps are simple – like changing your models’ materialization types, or updating their
dbt run schedules to run more frequently.
If, however, you’re interested in materialized views that are always up-to-date, you’ll have to bring Materialize into your tech stack. The good news is that we’ve done our best to make this easy. Materialize is a SQL database – that speaks PostgreSQL – and comes with a ready-to-use dbt-materialize dbt adapter. Plugging Materialize into your dbt project should feel like plugging in any other database, with the added benefit of incrementally-maintained materialized views.
If you’re interested in trying out Materialize, we’d love to connect! And if you’d like to chat through any of these ideas more, we’d be happy to connect in our Slack or on Twitter.