Towards Real-Time dbt

Transforming your data with dbt solves a bunch of important problems for you. dbt version controls your transformations, allows your team to collaborate easily, encourages documentation, and unlocks easy testing of your critical data.

However, there’s one thing it doesn’t help you with: reducing the time between data originating somewhere in your business to when you transform that input data into useful results for your business.

If you could somehow speed up the time it takes for new input data to show up in your models, you could: send out customer emails faster, detect fraud sooner, and keep your business-critical dashboards fresher throughout the day. As in all engineering, the options for making this happen represent trade-offs.

dbt latency framework illustration

In this post, we’ll cover some routes you might take to see the updated output of your dbt models faster: changing their materialization types, bumping up their dbt run schedule, hacking together a lambda view, or swapping out your data warehouse altogether. Let’s step through each option and explore the pros and cons - or trade-offs - of each implementation. If at any point you’re interested in learning more or asking questions, please reach out to us on our community Slack or Twitter!

Views

dbt officially supports four types of materializations: table, view, incremental, and ephemeral. These dictate how your transformation will be persisted in your underlying data warehouse once it is dbt run. Table will create a table, view will create a view, incremental will create a table with some extra bits, and ephemeral will use a CTE to return results instead of physically persisting anything in your warehouse.

In theory, the fastest way to see new input data in the output of your dbt models is to use a view materialization. Let’s dig into why that’s true.

Transforming your data on read with Views

When you dbt run models backed by the view materialization, you aren’t actually transforming your data at that moment. Instead, you’re just stashing the definition of your transformation – your SQL query – into your database to be used later. Your data will only be transformed when something – a person, an application, a dashboard – tries to read from that model later on.

Another way to describe this is that when you back your models with the view materialization, you’re electing to transform your data on read. Transforming your data on read is the simplest way to get fresh results from your dbt project. At any point when you SELECT from one of these models, you can be sure that you’re reading the most up-to-date data that’s available. If a row is available in your underlying database at query time, the results that are returned to you should reflect this.

This isn’t true when your models transform your data on write, or as the model is persisting them to the database. An example of this is the table materialization. Each time you dbt run a model backed by a table, dbt will run your query, transform your data, and stuff the transformed data into a new table. While this will improve a different type of latency – query latency – it actually means that you will see less fresh results between transformations. As new input data arrives, you now have to wait for the next dbt run to see that data reflected in your results.

As long as querying a view is fast enough for your end users, the view materialization is likely the fastest choice for incorporating new input in your results. But, at a certain point, you might hit their fatal flaw: views aren’t a scalable solution.

Scalability issues with Views

If you’re materializing your models as views, there are three scalability issues you might hit. The first is that as the amount of data you’re transforming grows, your query latency will get worse. It will take longer for results to be returned to you as you try to read from your models. The second reason is that if your transformations become more complicated, your query latency will also get worse. And the third reason is that you might end up paying a lot.

Each time you transform your data, you’re paying some cost: in wall clock time, in opportunity cost, and in compute credits. And when you are transforming your data on read, you’re paying that cost each and every time you SELECT from a view. This can add up quickly if the model is used often, or by many people.

So even though materializing your models as views is the simplest way to generate fresh results, it might not scale as your project grows. Let’s explore some other, and potentially more scalable, solutions.

Microbatches

So you want to get from raw data to transformed results as fast as possible, but you also want to do it in a way that’s scalable. You’d prefer to limit your costs, and you’d prefer to put a reasonable cap on your query latencies. If you can’t accomplish both goals with models that transform data on read, is it possible to do it with models that transform data on write?

The answer here is maybe! It depends! Let’s take a look at one way this might work.

Transforming your data on write with Microbatches

When you choose a materialization type that transforms your data on write, you’re actually transforming your data with each dbt run. Then, you’re persisting that transformed data to your database. When you go to SELECT from that data later, you’re going to get those results back very quickly. You also won’t be paying the cost to transform that data with each SELECT, so we’ve solved our scalability issue.

But, we’ve slowed down the amount of time it takes to turn our raw data into updated results. Where views always return fresh results, models that transform their data on write can easily fall out of date. All that needs to happen is for a new row of input data to arrive. Because dbt run transforms your input data, the output of your models will not reflect this new data. If you schedule your model to dbt run every hour, it could be out of date for 59 minutes and 59 seconds of that hour. So, so far, this solution doesn’t satisfy both of our requirements.

But, what if we just dbt run these models more often?

Microbatches = Frequent batch jobs

This is what’s called micro-batching! Micro-batching just means that you’re running your batch jobs more often and over smaller accumulations of data. And, logistically, it’s a pretty easy swap. All you really have to do is change the schedule of how often your model is dbt run, and potentially some time-based constraints in your SQL queries.

Micro-batching might solve your problem if you’re reasonable about how often you run your models. It might just work to run your model daily instead of weekly, or every 10 minutes instead of every 60. But, it’s important to be mindful of the lower bound here.

The micro-batching problem

In discussing micro-batching, dbt asks you nicely not to. They warn that if you’re bumping your models dbt run schedule enough that your models are running continuously – or overlapping – you may run into some tough, unexpected problems.

If you’re dbt running your models as often as you reasonably can, but your data still isn’t fresh enough: don’t worry! We’re not out of solutions yet. The next set of proposals dare to ask: what if we let our tools handle the hard stuff for us?

Materialized views, incremental models, and lambda views

Transforming our data on read returns the freshest data. Transforming our data on write scales as our projects grow. What if there was some sort of sweet spot between the two? In this section, we’ll explore a few different implementations of that in-between state.

Materialized views

The first option is to back your models with a materialized view in a traditional database. The first time you dbt run these models, all of your data will be transformed and persisted into that new materialized view object. On subsequent dbt runs, you will have the choice of how to refresh that view to reflect any newly-arrived input data.

In theory, this should mean that you only pay the cost to transform each input row exactly one time. You transform all the data you have at once, and then you transform additional rows on write or on read as they arrive. This solution would address our primary concern of seeing new input data in our results faster while also keeping our other scalability concerns like query latency and cost in check.

Depending on your data and your underlying data warehouse, backing your models with materialized views might just work. But, for many databases, it’s not usually this simple in practice. With some databases, making materialized views work this way could be complicated, costly, or flat-out impossible. (If this is disappointing to you, it was disappointing to us, too! Keep reading to see how we’ve changed things with our materialized views.)

Incremental models

The second option pulls the complexity away from your database and puts it in the hands of dbt: dbt’s incremental materialization. The incremental materialization is dbt’s attempt to manually do the thing that you want your database to do with materialized views: efficiently update your models as new underlying data arrives.

To do this, dbt runs some SQL statements on your behalf with each dbt run. As they explain in their docs, incremental models fall somewhere between table and view performance – or that balance between on write and on read transformations that we’re looking for.

Incremental models get us from new input data to updated model results more quickly, and to date it’s probably the most popular solution! A big benefit is that you can use incremental models with any underlying data warehouse, where support for materialized views is still more limited.

However, incremental materialization also comes with a few costs. The first is that using an incremental materialization for your models makes your

project more complicated – dbt warns that it is mostly for advanced usage. And second, it’s not the most efficient way to solve our problem. To solve our end-to-end latency problem efficiently, and in a way that scales nicely, we’ll need our database to maintain incremental state for us – not dbt. We’ll talk about this more in a bit.

Lambda views

Our third option splits the onus between dbt and the author of the dbt project: lambda views. Lambda views are a different approach to solving the same problem as materialized views and incremental materializations. However, instead of asking your database or dbt to do the hard work, you manually orchestrate the outcome you’d like to see yourself.

The good news about lambda views is that they work. They return up-to-date results each time you query your model, meaning that you are able to see fresh results quickly.

The bad news is that lambda views are the most difficult way out of the three to accomplish this goal. Because you’re not pushing the complexity down to the database or dbt, you have to manually create the stacks of models and the performant filters required to achieve your result. While this is possible, it’s the solution that puts the most burden on you.

Each of these options presents a more scalable solution to our problem of turning inputs into transformed output as quickly as possible. But none of them are entirely satisfying yet. They all come with some cost—due to unnecessary computation, extra computation, or brittle configuration. At this point, you might be wondering, why isn’t there some tool that’s fully solved this problem for us?

The best solution: Incremental view maintenance

We want to speed up the time it takes to turn raw data into generated results. We also want our solution to scale as our amount of data grows and our transformations get more complex. What we really want is for something to transform our input data as it changes – for data freshness – while only transforming that input data once – for scalability.

By describing our goal and our constraints, we’ve found a new solution to our problem: incremental view maintenance.

Incremental view maintenance is a technique for maintaining up-to-date results of a function as its underlying inputs change. This technique transforms data on write: it continually transforms input data as it arrives, which in turn continually updates the result.

Incremental view maintenance is one strategy for keeping materialized views up to date. Earlier, I mentioned that materialized views come close to that sweet spot we’re looking for – they scale better than views, but we do have to do some extra work to ensure they’re returning fresh results. But that’s because most materialized views aren’t maintained incrementally. Most materialized views transform data more like micro-batches: they transform data that’s arrived since their last refresh on some interval or due to some trigger.

Benefits of incremental view maintenance

But when we create incrementally maintained materialized views—or materialized views whose results are updated incrementally as each new row of input data arrives—things look a little different. These materialized views have the main benefit of transforming data on write.

In other words, they’re scalable. They only transform incoming data once, regardless of how often it’s read. The different refresh strategy gives these materialized views an unexpected benefit: they improve our end-to-end latency, too.

Just like the views we examined at the beginning of this post, incrementally-maintained materialized views always return fresh, up-to-date results when queried. So if you’re building a model on top of an incrementally-materialized view, each time you SELECT from that model you will see an up-to-date result.

Materializing your dbt models as incrementally-maintained materialized views is the most scalable way to improve the end-to-end latency of your dbt projects. And, they come with an operational perk.

For other materialization types that transform your data on write, you have to continually dbt run those models to generate fresh results. With incrementally-maintained materialized views, you only have to dbt run your model once. Incrementally-maintained materialized views continue to keep your results up to date as new input data arrives, without any scheduled or manual refreshes.

Updating your dbt projects

In this post, we’ve explored a few different ways you could improve the data freshness of your dbt projects. Depending on the use case you’re working on and the constraints you’re working within, some of these solutions may have stood out more than others. Some of these swaps are simple – like changing your models’ materialization types, or updating their dbt run schedules to run more frequently.

If, however, you’re interested in materialized views that are always up-to-date, you’ll have to bring Materialize into your tech stack. The good news is that we’ve done our best to make this easy. Materialize is a SQL database – that speaks PostgreSQL – and comes with a ready-to-use dbt-materialize dbt adapter. Plugging Materialize into your dbt project should feel like plugging in any other database, with the added benefit of incrementally-maintained materialized views.

If you’re interested in trying out Materialize, we’d love to connect! And if you’d like to chat through any of these ideas more, we’d be happy to connect in our Slack or on Twitter.

Towards Real-Time dbt

Transform, Deliver, and Act

Related Posts You’ll Love