dbt is a great tool for managing your data pipelines. But it’s not built out of the box for real-time. The good news: you can light up dbt real-time analytics scenarios by pairing it with Materialize. Here’s how.
The challenge with dbt real-time analytics
Managing data is hard. Managing data pipelines is even harder. The meaning of individual tables or values in your data warehouse gets lost in translation across organizations. Another team’s refactor breaks your team’s pipeline. And, it’s normally very difficult to tell who made what change and when.
dbt alleviates these frustrations by taking over the transformation step in your ETL pipelines. dbt is not itself a data processor, but instead sits on top of your data warehouse that contains your already extracted and loaded data. dbt allows teams to easily test, document, and version-control their data transformations.
While dbt is a great tool for transforming batch data, it can’t currently transform streaming data in real-time. (The dbt team explicitly warns users about this in a few places.) Here at Materialize, we want to help the world stop batching and start streaming. So we* built a dbt adapter that will allow you to transform your streaming data in real-time using Materialize as your data warehouse.
The rest of this post explores why dbt works best with batch data and how using Materialize unlocks streaming transformations. If you’re eager to get started, the dbt-materialize adapter is here and our sample streaming project is here.
*The dbt-materialize adapter was originally created by Josh Wills and actively shaped by Jeremy Cohen. Thank you for all of your work and support!
dbt and batch data vs. real-time analytics
dbt is great at transforming batch data. However, it can’t transform streaming data efficiently in real-time. To understand why, let’s take a look at how dbt transforms data under the hood.
dbt users define their desired transformations using dbt models. These are SQL files that contain:
- A SELECT statement that performs the desired transformation
- A “materialization” parameter
dbt transforms your data each time you run a model. Each time a model is run, dbt queries the underlying data warehouse using that model’s SELECT statement. dbt then either returns the result set of the query (the transformed data) directly to the user or persists it into your data warehouse, depending on the model’s materialization parameter.
Currently, dbt supports four types of materializations: table, view, incremental, and ephemeral. The table and incremental materializations persist a table, while the view materialization creates a view. The ephemeral materialization, instead of persisting anything, returns results directly using a common table expression (CTE).
The good news is that these database objects are totally sufficient to transform batch data. The bad news is that none of these database objects transform streaming data efficiently.
First, what do I mean by batch and streaming data? Batch data, as the name suggests, is any data that arrives in discrete batches. This can be once a minute, once an hour, or once a day. The important thing is that no new data arrives between batches. Streaming data, however, arrives continually and at no particular schedule.
So, why are these database objects sufficient to transform batch data but not able to efficiently transform streaming data?
Views and CTEs do not physically persist data in your data warehouse. This means that each time you query a model that uses a view or CTE, your data warehouse must re-transform the underlying source data. Each time you transform your source data, you’re paying some cost. While views and CTEs always return up-to-date transformations of your batch and streaming data, they do not do so efficiently.
Tables, on the other hand, do physically persist data. More specifically, tables persist the result set of the last time someone ran dbt run
on their model.
Unlike views and CTEs, this means that you won’t pay the price of transforming data each time your table is queried. However, this means that your transformed data can quickly become stale as new data arrives.
This isn’t an issue with batch data because you can simply call dbt run
for your table each time a new batch arrives. Unfortunately, things aren’t so simple with streaming data.
Because streaming data doesn’t arrive on a schedule, there isn’t a right time to re-run your models to keep them up-to-date and enable real-time analytics. Instead, you’re forced to choose between maximizing data freshness and minimizing transformation costs.
You can minimize your costs by limiting how often you recreate your tables, effectively turning your streaming data into batch data. Or, you can maximize your data freshness by continually recreating your tables. But, this approach will cost you time and money, leave you vulnerable to bugs, and still won’t enable true real-time analytics.
So, what should you do if you want to achieve dbt real-time analytics?
dbt and real-time analytics with streaming data
dbt currently has one official and one unofficial way to approximate transforming streaming data. Neither of these methods truly achieves dbt real-time analytics. Both come at a cost.
The first method to approximate transforming streaming data is to create models with an incremental materialization. The first time you run an incremental model, dbt persists your transformation’s result set into a table in your data warehouse. For subsequent runs, dbt only transforms the subset of source data indicated by your model’s filter predicate. (For example, you might have a filter predicate that will only transform data with a timestamp greater than your last model’s run.)
Incremental models reduce the severity of the tradeoff that users face when persisting their transformations in tables (data freshness vs cost). They don’t eliminate the tradeoff entirely. By design, you will probably be paying a lesser cost each time you dbt run
an incremental model.
(I say “probably” here because even though you’re only transforming a few rows of data with each run, unless you’re filtering cleverly, your model’s SELECT statement will still have to scan the entire underlying source table or view to discover these rows.)
These lesser costs may free you up to run your incremental models more frequently. However, you still can’t run them continuously. By definition, you are still transforming your streaming data with a batch process - and not unlocking real-time analytics.
The second way to approximate transforming streaming data is the unofficial “lambda view” approach. This method simulates transformations over “near real-time models” by querying a combined historical table and a current view.
This approach incurs the cost of querying both of the underlying database objects using some filter, similar to the incremental materialization. The current view of your data returns up-to-date results but must re-transform the recent data each time.
Neither of these methods can efficiently transform data in real-time. (And they come with hairy problems if, say, you have streaming data that might arrive late.) To efficiently perform worry-free, real-time transformations of streaming data, dbt would need to persist a database object that updates as new data arrives upstream.
Luckily, there’s a database object that can do this for us: materialized views.
dbt real-time analytics with Materialize
Materialized views in traditional databases behave a bit like dbt’s incremental materialization. When you first create a materialized view, the database physically persists the result set of its query. Then, at some interval or when manually triggered, the database updates the stored result set with recent data. Like the incremental materialization, maintaining these materialized views incurs a variety of costs.
This is the exact problem Materialize was created to solve. Unlike traditional materialized views, our materialized views continually update as new data arrives—no refreshes needed. Better yet, we provide up-to-date results with millisecond latency. (For more information about Materialize and our materialized views, check out our documentation.)
So, what does this mean for dbt and streaming data? This means that the first time you run a dbt model on top of Materialize, dbt persists a materialized view. Then, you never have to run your model again.
No matter how much or how frequently your data arrives, your model will stay current. No matter when you query your view, it will return a fresh answer. Just by creating your model with our materialized views, you can confidently and efficiently transform streaming data to achieve real-time analytics.
Try it out!
Excited? Skeptical? Cautiously optimistic? Try it out for yourself! As mentioned before, we have a beta dbt-adapter, a demo streaming project and a walkthrough of the beloved dbt jaffle_shop example with Materialize. If you have any thoughts, questions, or concerns, please feel free to contact us in our community Slack or in our dbt repos. (Or, when you’re up and running, tell us what you’re transforming in real-time!)