Managing streaming analytics pipelines with dbt

Marta Paes

Managing streaming analytics pipelines with dbt image

Metabase community stories

This blogpost was originally contributed to the Metabase community blog.

In a previous post, we broke down how to build a streaming analytics pipeline with Materialize and Metabase. The TL;DR is that you can write some SQL against streaming data sources, let Materialize efficiently maintain your results up-to-date as new data arrives, and keep your dashboards light and fresh.

This time around, we’re going to explore how to use dbt to manage and document this workflow end-to-end.

Transforming streaming data with dbt

As much as you want to run your analytics non-stop, one thing you probably don’t want is to let go of the tools that make you productive in batch. What if you could manage your streaming analytics pipelines using the exact same tooling?

Although dbt was built with batch in mind, it’s flexible enough as a framework to serve as a unified transformation layer on top of both batch and streaming backends; as long as that backend is SQL-based, what’s running under the covers becomes an implementation detail.

We’ve built the dbt-materialize adapter to bring streaming transformations (and beyond) to dbt. If this looks familiar…

{{ config(
    materialized ='materializedview'
) }}
SELECT fi.icao24,
FROM {{ ref('stg_flight_information') }} fi
JOIN {{ ref('stg_icao_mapping') }} icao ON fi.icao24 = icao.icao24

…that’s all there is: you define your business logic as dbt models using SQL and some Jinja, deploy the pipeline (once), and Materialize keeps things up and running for you. For use cases that would otherwise require you to redeploy your models multiple times a day ($$), maintain complex incremental logic and make some serious trade-offs to optimize for speed (like, say, correctness), using a dedicated streaming database like Materialize can take you further, faster.

Documenting streaming analytics pipelines

More than standardizing how we reason about and manage analytics workflows, dbt made documentation cool again (well, was it ever really cool before dbt?). With a few YAML files, you can bring data governance to your streaming pipelines, speeding up life-saving processes like data discovery and lineage, and even ensure that your metrics stay put. If the tendency is to treat your dbt projects as the source of truth for business logic and documentation, shouldn’t your BI tool just…get in sync?

How to sync dbt and Metabase

The dbt-metabase plugin is a great starting point if you’re looking to bring dbt and Metabase closer together. For example, you can use it to:

Document Metabase items as dbt exposures

Getting a good understanding of the end-to-end dependencies of a dbt project requires also being able to track dependencies that are external to it, like Metabase questions and dashboards. One way to track these dependencies in the DAG is to declare any use of models downstream as exposures (notice the orange node!), which you can generate automatically using the plugin:

Propagate dbt metadata to the Metabase data model

If you’re already documenting your data model in dbt, the plugin also allows you to derive the Metabase data model from existing model properties and configurations, like table and column descriptions, semantic type definitions and other useful metadata that helps create a shared context between data producers and consumers:

There’s a lot more to explore when it comes to managing streaming analytics pipelines with dbt, so if you’re curious to get your hands down, check out the sample demo used to create the examples above and the Materialize+dbt integration guide.