Managing data is hard. Managing data pipelines is even harder. The meaning of individual tables or values in your data warehouse gets lost in translation across organizations. Another team’s refactor breaks your team’s pipeline. And, it’s normally very difficult to tell who made what change and when.
dbt, data build tool, alleviates these frustrations by taking over the transformation step in your ETL pipelines. dbt is not itself a data processor, but instead sits on top of your data warehouse that contains your already extracted and loaded data. dbt allows teams to easily test, document, and version-control their data transformations.
While dbt is a great tool for transforming batch data, it cannot currently transform streaming data in real time. (The dbt team explicitly warns users about this in a few places.) Here at Materialize, we want to help the world stop batching and start streaming. So we* built a dbt adapter that will allow you to transform your streaming data in real time using Materialize as your data warehouse.
The rest of this post will explore why dbt works best with batch data and how using Materialize unlocks streaming transformations. If you’re eager to get started, the dbt-materialize adapter is here, and our sample streaming project is here. Note: The dbt-materialize adapter is an active work in progress and not yet suitable for production use-cases. Please file issues or submit PRs as you see fit, we love feedback!
*The dbt-materialize adapter was originally created by Josh Wills and actively shaped by Jeremy Cohen. Thank you for all of your work and support!
dbt and batch data
dbt is great at transforming batch data, but it cannot transform streaming data efficiently in real time. To understand why, let’s take a look at how dbt transforms data under the hood.
dbt users define their desired transformations using dbt “models”. dbt models are SQL files that contain:
- A SELECT statement that performs the desired transformation
- A “materialization” parameter
dbt transforms your data each time you “run” a model. Each time a model is run, dbt queries the underlying data warehouse using that model’s SELECT statement. The result set of the query (the transformed data) is then either returned directly to the user or persisted into your data warehouse, depending on the model’s materialization parameter.
Currently, dbt supports four types of materializations: table, view, incremental, and ephemeral. The table and incremental materializations persist a table, the view materialization creates a view, and the ephemeral materialization, instead of persisting anything, returns results directly using a common table expression (CTE). The good news is that these database objects are totally sufficient to transform batch data. The bad news is that none of these database objects transform streaming data efficiently.
First, what do I mean by batch and streaming data? Batch data, as the name suggests, is any type of data that arrives in discrete batches. This can be once a minute, once an hour, or once a day. The important thing is that no new data arrives between batches. Streaming data, on the other hand, arrives continually and at no particular schedule.
So, why are these database objects sufficient to transform batch data, but not able to efficiently transform streaming data?
Views and CTEs do not physically persist data to your data warehouse. This means that each time you query a model that uses a view or CTE, your data warehouse must re-transform the underlying source data. And, each time you transform your source data, you are paying some cost. While views and CTEs always return up-to-date transformations of your batch and streaming data, they do not do so efficiently.
Tables, on the other hand, do physically persist data. More specifically, tables persist the result set of the last time their model “dbt run.” Unlike views and CTEs, this means that you won’t pay the price of transforming data each time your table is queried. But, this means that your transformed data can quickly become stale as new data arrives. This is not an issue with batch data because you can simply “dbt run” your table each time a new batch arrives. Unfortunately, things aren’t so simple with streaming data.
Because streaming data does not arrive on a schedule, there is no longer a right time to re-run your models to keep them up-to-date. Instead, you’re forced to choose between maximizing data freshness and minimizing transformation costs. You can minimize your costs by limiting how often you recreate your tables, effectively turning your streaming data into batch data. Or, you can maximize your data freshness by continually recreating your tables. But, this approach will cost you time and money, leave you vulnerable to bugs, and still won’t maintain truly up-to-date results.
So, what should you do if you want to transform streaming data with dbt?
dbt and streaming data
dbt currently has one official and one unofficial way to approximate transforming streaming data. Neither of these methods truly transforms streaming data in real-time, and both come at a cost.
The first method to approximate transforming streaming data is to create models with an incremental materialization. The first time you run an incremental model, dbt persists your transformation’s result set into a table in your data warehouse. For subsequent runs, dbt only transforms the subset of source data indicated by your model’s filter predicate. (For example, you might have a filter predicate that will only transform data with a timestamp greater than your last model’s run.)
While incremental models reduce the severity of the tradeoff that users face when persisting their transformations in tables (data freshness vs cost), they do not eliminate the tradeoff entirely. By design, you will probably be paying a lesser cost each time you “dbt run” an incremental model. (I say “probably” here because even though you’re only transforming a few rows of data with each run, unless you’re filtering cleverly, your model’s SELECT statement will still have to scan the entire underlying source table or view to discover these rows). While these lesser costs may free you up to run your incremental models more frequently, you still will not be able to run them continuously. By definition, you are still transforming your streaming data with a batch process.
The second way to approximate transforming streaming data is the unofficial “lambda view” approach. This method simulates transformations over “near real-time models” by querying a combined historical table and a current view. This approach incurs the cost of querying both of the underlying database objects using some filter, similar to the incremental materialization. The current view of your data returns up-to-date results, but must re-transform the recent data each time.
Neither of these methods can efficiently transform data in real time. (And, they come with hairy problems if, say, you have streaming data that might arrive late.) In order to efficiently perform worry-free, real-time transformations of streaming data, dbt would need to persist a database object that updates as new data arrives upstream. Luckily, there is a database object that can do this for us: materialized views.
dbt and Materialize
Materialized views in traditional databases behave a bit like dbt’s incremental materialization. When a materialized view is first created, the result set of its query is physically persisted in the database. Then, at some interval or when manually triggered, the stored result set is updated with recent data. Like the incremental materialization, maintaining these materialized views incurs a variety of costs.
This is the exact problem Materialize was created to solve. Unlike traditional materialized views, our materialized views continually update as new data arrives–no refreshes needed. Better yet, we provide up-to-date results with millisecond latency. (For more information about Materialize and our materialized views, check out our documentation.)
So, what does this mean for dbt and streaming data? This means that the first time you run a dbt model on top of Materialize, dbt persists a materialized view. Then, you never have to run your model again. No matter how much or how frequently your data arrives, your model will stay up to date. No matter when you query your view, it will return a fresh answer. Just by creating your model with our materialized views, you can confidently and efficiently transform streaming data in real time.
Try it out!
Excited? Skeptical? Cautiously optimistic? Try it out for yourself! As mentioned before, we have a beta dbt-adapter, a demo streaming project and a walkthrough of the beloved dbt
jaffle_shop example with Materialize. If you have any thoughts, questions, or concerns, please feel free to contact us in our community Slack or in our dbt repos. (Or, when you’re up and running, tell us what you’re transforming in real time!)