How to use dbt to manage Materialize
dbt-materialize
adapter can only be used with dbt Core. We are working with the dbt community to bring native Materialize support to dbt Cloud!
dbt has become the standard for data transformation (“the T in ELT”). It combines the accessibility of SQL with software engineering best practices, allowing you to not only build reliable data pipelines, but also document, test and version-control them.
In this guide, we’ll cover how to use dbt and Materialize to transform streaming data in real time — from model building to continuous testing.
Setup
Minimum requirements: dbt v0.18.1+
Setting up a dbt project with Materialize is similar to setting it up with any other database that requires a non-native adapter. To get up and running, you need to:
-
Install the
dbt-materialize
plugin (optionally using a virtual environment):python3 -m venv dbt-venv # create the virtual environment source dbt-venv/bin/activate # activate the virtual environment pip install dbt-materialize # install the adapter
The installation will include
dbt-core
and thedbt-postgres
dependency. To check that the plugin was successfully installed, run:dbt --version
materialize
should be listed under “Plugins”. If this is not the case, double-check that the virtual environment is activated! -
To get started, make sure you have Materialize installed and running.
Create and configure a dbt project
A dbt project is a directory that contains all dbt needs to run and keep track of your transformations. At a minimum, it must have a project file (dbt_project.yml
) and at least one model (.sql
).
To create a new project, run:
dbt init <project_name>
This command will bootstrap a starter project with default configurations and create a profiles.yml
file, if it doesn’t exist.
Connect to Materialize
dbt manages all your connection configurations (or, profiles) in a file called profiles.yml
. By default, this file is located under ~/.dbt/
.
-
Locate the
profiles.yml
file in your machine:dbt debug --config-dir
Note: If you started from an existing project but it’s your first time setting up dbt, it’s possible that this file doesn’t exist yet. You can manually create it in the suggested location.
-
Open
profiles.yml
and adapt it to connect to your Materialize instance using the reference profile configuration.As an example, the following profile would allow you to connect to a Materialize instance running locally (
dev
):default: outputs: dev: type: materialize threads: 1 host: localhost port: 6875 user: materialize pass: password dbname: materialize schema: public target: dev
The
target
parameter allows you to configure the target environment that dbt will use to run your models. -
To test the connection to Materialize, run:
dbt debug
If the output reads
All checks passed!
, you’re good to go! The dbt documentation has some helpful pointers in case you run into errors.
Build and run dbt models
In dbt, a model is a SELECT
statement that encapsulates a data transformation you want to run on top of your database. For dbt to know how to persist (or not) a transformation, the model needs to be associated with a materialization strategy.
dbt models
When you use dbt with Materialize, your models stay up-to-date without manual or configured refreshes. This allows you to efficiently transform streaming data using the same thought process you’d use for batch transformations on top of any other database.
-
Create a model for each SQL statement you’re planning to deploy. Each individual model should be stored as a
.sql
file under the directory defined bysource-paths
indbt_project.yml
.As an example, we’ll use the SQL statements in our getting started guide and re-write them as dbt models.
Creating a source
You can instruct dbt to create a source in Materialize using the custom
source
materialization:{{ config(materialized='source') }} {% set source_name %} {{ mz_generate_name('market_orders_raw') }} {% endset %} CREATE SOURCE {{ source_name }} FROM PUBNUB SUBSCRIBE KEY 'sub-c-4377ab04-f100-11e3-bffd-02ee2ddab7fe' CHANNEL 'pubnub-market-orders'
The
mz_generate_name
macro allows you to generate a fully-qualified name from a base object name. Here,source_name
would be compiled tomaterialize.public.market_orders_raw
.Creating a view
dbt models are materialized as
views
by default, so to create a view in Materialize you can simply provide the SQL statement in the model (and skip thematerialized
configuration parameter):SELECT ((text::jsonb)->>'bid_price')::float AS bid_price, (text::jsonb)->>'order_quantity' AS order_quantity, (text::jsonb)->>'symbol' AS symbol, (text::jsonb)->>'trade_type' AS trade_type, to_timestamp(((text::jsonb)->'timestamp')::bigint) AS ts FROM {{ ref('market_orders_raw') }}
One thing to note here is that the model depends on the source defined in the previous step. To express this dependency and track the lineage of your project, you can use the dbt ref() function (#8744).
Creating a materialized view
This is where Materialize goes beyond dbt’s incremental models (and traditional databases), with materialized views that continuously update as the underlying data changes:
{{ config(materialized='materializedview') }} SELECT symbol, AVG(bid_price) AS avg FROM {{ ref('market_orders') }} GROUP BY symbol
When should you use what? We recommend using
materializedview
models exclusively for your core business logic to ensure that you’re not consuming more memory than needed in Materialize. Intermediate or staging views should use theview
materialization type instead. -
Run the dbt models:
dbt run
This command generates executable SQL code from any model files under the specified directory and runs it in the target environment. You can find the compiled statements under
/target/run
andtarget/compiled
in the dbt project folder. -
Using a new terminal window, connect to Materialize to double-check that all objects have been created:
psql -U materialize -h localhost -p 6875 materialize
materialize=> SHOW SOURCES; name ------------------- market_orders_raw materialize=> SHOW VIEWS; name ------------------- avg_bid market_orders
That’s it! From here on, Materialize makes sure that your models are incrementally updated as new data streams in, and that you get fresh and correct results with millisecond latency whenever you query your views.
Materializations
dbt models are materialized as views
by default, but can be configured to use a different materialization type through the materialized
configuration parameter. This parameter can be set directly in the model file using:
{{ config(materialized='materializedview') }}
Because Materialize is optimized for real-time transformations of streaming data and the core of dbt is built around batch, the dbt-materialize
adapter implements a few custom materialization types:
Type | Details |
---|---|
source | Creates a source. |
view | Creates a view. |
materializedview | Creates a materialized view. |
table | Creates a materialized view (actual table support pending #5266). |
index | (Deprecated) Creates an index. |
sink | Creates a sink. |
ephemeral | Executes queries using CTEs. |
Test and document a dbt project
Continuous testing
Using dbt in a streaming context means that you’re able to run data quality and integrity tests non-stop, and monitor failures as soon as they happen. This is useful for unit testing during the development of your dbt models, and later in production to trigger real-time alerts downstream.
-
To configure your project for continuous testing, add a
tests
property todbt_project.yml
with thestore_failures
configuration:tests: mz_get_started: marts: +store_failures: true +schema: 'etl_failure'
This will instruct dbt to create a materialized view for each configured test that can keep track of failures over time. By default, test views are created in a schema suffixed with
dbt_test__audit
. To specify a custom suffix, use theschema
config.Note: As an alternative, you can specify the
--store-failures
flag when runningdbt test
. -
Add tests to your models using the
tests
property in the model configuration.yml
files:models: - name: avg_bid description: 'Computes the average bid price' columns: - name: symbol description: 'The stock ticker' tests: - not_null - unique
The type of test and the columns being tested are used as a base for naming the test materialized views. For example, the configuration above would create views named
not_null_avg_bid_symbol
andunique_avg_bid_symbol
. -
Run the tests:
dbt test
When configured to
store_failures
, this command will create a materialized view for each test using the respectiveSELECT
statements, instead of doing a one-off check for failures as part of its execution.This guarantees that your tests keep running in the background as views that are automatically updated as soon as an assertion fails.
-
Using a new terminal window, connect to Materialize to double-check that the schema storing the tests has been created, as well as the test materialized views:
psql -U materialize -h localhost -p 6875 materialize
materialize=> SHOW SCHEMAS; name ------------------- public public_etl_failure materialize=> SHOW VIEWS FROM public_etl_failure;; name ------------------- not_null_avg_bid_symbol unique_avg_bid_symbol
With continuous testing in place, you can then build alerts off of the test materialized views using any common PostgreSQL-compatible client library and TAIL
(see the Python cheatsheet for a reference implementation).
Documentation
dbt can automatically generate documentation for your project as a shareable website. This brings data governance to your streaming pipelines, speeding up life-saving processes like data discovery (where to find what data) and lineage (the path data takes from source(s) to sink(s), as well as the transformations that happen along the way).
-
Optionally, create a
.yml
file with helpful properties about your project resources (like model and column descriptions, or tests) and add it to directory where your models live:version: 2 sources: - name: public description: "Public schema" tables: - name: market_orders_raw description: "Market order data source (PubNub)" models: - name: market_orders description: "Converts market order data to proper data types" - name: avg_bid description: "Computes the average bid price" columns: - name: symbol description: "The stock ticker" tests: - not_null - name: avg description: "The average bid price"
-
To generate documentation for your project, run:
dbt docs generate
dbt will grab any additional project information and Materialize catalog metadata, then compile it into
.json
files (manifest.json
andcatalog.json
, respectively) that can be used to feed the documentation website. You can find the compiled files under/target
, in the dbt project folder. -
Launch the documentation website. By default, this command starts a web server on port 8000:
dbt docs serve #--port <port>
-
In a browser, navigate to
localhost:8000
. There, you can find an overview of your dbt project, browse existing models and metadata, and in general keep track of what’s going on.If you click View Lineage Graph in the lower right corner, you can even inspect the lineage of your streaming pipelines!