Source Mapping and Introspection: Debugging Materialize with Materialize

Michael Greenberg

January 30, 2025

We have a new way to understand the performance of views, indexes, and materialized views in Materialize. By mapping runtime data about low-level dataflow operators up to a sensible intermediate language, you'll be better able to identify and refine computationally expensive parts of your queries.

Databases have a lot of `EXPLAIN`ing to do

Databases typically offer some way to understand how queries run. Postgres, for example, has the EXPLAIN statement; running EXPLAIN ... query ... presents the user with a summary of how the plan will be run (what kind of joins, etc.) along with an estimate of the cost.1 Postgres's EXPLAIN ANALYZE statement does one better: it actually runs the query, collecting information as it goes; when the query terminates, it displays the plan, annotated with runtime data like memory used and time spent in each part of the plan.

Materialize has been able to EXPLAIN queries for a long time, but adding in runtime feedback is harder. What would it mean to EXPLAIN ANALYZE CREATE INDEX ...? When should Materialize stop reporting information? Indexes in Materialize don't have an 'end time'! What if you want information about an index that's already running?

We've implemented a new way to glean insights into how your indexes and materialized views are running. To understand how to use it, let's take a quick detour through how Materialize compiles your SQL queries down to dataflows.

Materialize's compilation pipeline

Materialize compiles SQL through a series of intermediate languages: a high-level intermediate language (HIR), a mid-level intermediate language (MIR), and a low-level intermediate language (LIR). A SQL query is translated to an HIR query, which is then translated into one or more MIR queries. Our optimizer does the bulk of its decision making in MIR: planning joins, removing redundancies, and identifying patterns Materialize can run particularly effectively. The compiler then lowers MIR into LIR, our final intermediate representation. LIR is abstract enough to still be a 'high-level' plan, but LIR is low-level enough to explicitly map out all of the details of the plan: how to aggregate, which indexes to use, etc. Having fixed the plan in LIR, translating the LIR to dataflows---the actual runtime engine of Materialize---is straightforward enough.

Materialize already tracks a variety of runtime information in the mz_instrospection schema. But this runtime information is attributed to the dataflow operators that Materialize actually runs. After running our compiler, these dataflow operators don't look anything like the original SQL query! A SQL query might have hundreds of dataflow operators, and it takes real expertise to know which operators correspond to which parts of the query.

To bridge that expertise gap, we've created a source map, called mz_introspection.mz_lir_mapping. We map ranges of dataflow operators up to LIR operators---the fixed plans that are the last intermediate representation in our compiler. Using mz_lir_mapping, you can relate performance statistics---like total computation time and memory usage---to a high-level representation, as seen in our EXPLAIN or Postgres's EXPLAIN ANALYZE.

Mapping dataflow metrics up to LIR

It's easiest to get a feel for what mz_lir_mapping does for you by example. Let's start by generating a sample database tracking which customers bought which products.

CREATE TABLE sales (product_id BIGINT NOT NULL, customer_id BIGINT NOT NULL); 
CREATE INDEX idx_sales_by_product ON sales(product_id); 

 
-- generates 100 products and 10k customers, each buying 16 products 
INSERT INTO sales (product_id, customer_id) 
    WITH 
      product_seed(salt) AS 
        (VALUES ('abc'), ('def'), ('ghi'), ('jkl'), ('mno'), ('qrs'), ('tuv'), ('wxy'), 
                ('zAB'), ('CDE'), ('FGH'), ('IJK'), ('LMN'), ('OPQ'), ('RST'), ('UVW')) 
    SELECT seahash(product_seed.salt || customer_id::text) % 100 AS product_id, 
           customer_id 
      FROM            generate_series(1, 10000) AS customer_id 
           CROSS JOIN product_seed; 
 sql

Since Materialize is deterministic, we'll generate random-feeling data deterministically, using a hash with salt.2

The resulting distribution of purchases across products is fairly uniform:

SELECT MIN(count), AVG(count), ROUND(STDDEV(count), 1) AS stddev, MAX(count) 
FROM (SELECT COUNT(product_id) AS count 
       FROM sales 
   GROUP BY product_id 
   ORDER BY count DESC); 
 sql

min

avg

stddev

max

1485

1600

41.2

1698

With our toy database populated, let's explore how an analytics query performs: who are the top 5 buyers of the top 5 products? First, let's create some views: one for the top 5 products, one to count purchases of those popular products, and one for the top 5 buyers of each of those top 5 products.

-- top 5 most popular products 
CREATE VIEW popular_products(product_id) AS 
 SELECT product_id 
   FROM sales 
GROUP BY product_id 
ORDER BY COUNT(product_id) DESC 
  LIMIT 5; 

 
-- number of purchases of each popular product, per customer 
CREATE VIEW popular_purchases(product_id, customer_id, count) AS 
 SELECT product_id, customer_id, COUNT(customer_id) 
   FROM      popular_products 
        JOIN sales 
        USING (product_id) 
GROUP BY product_id, customer_id; 

 
-- top 5 buyers of each of the top five popular products 
CREATE VIEW top_buyers(customer_id, product_id, count) AS 
  SELECT customer_id, product_id, count 
   FROM (SELECT DISTINCT product_id FROM popular_purchases) products, 
        LATERAL (  SELECT customer_id, count FROM popular_purchases 
                    WHERE product_id = products.product_id 
                 ORDER BY count DESC 
                    LIMIT 5) 
ORDER BY product_id, count DESC; 
 sql

Having defined these views, let's index top_buyers by product_id, so our analytics dashboard can quickly look up who the top buyers of our top products are.

1 CREATE INDEX idx_top_buyers ON top_buyers(product_id); 
 sql

Now that we have the idx_top_buyers index, let's understand its performance using the mz_lir_mapping source mapping.

Attributing memory usage

Materialize's incremental view maintenance trades space for time: we're able to give consistent, up-to-the-minute second (!) answers by caching appropriately. Since caching uses memory, if we're "optimizing a query", then managing memory is the name of the game. Let's combine the new source mapping mz_introspection.mz_lir_mapping (how is our query implemented?) with the metrics data mz_introspection.mz_arrangement_sizes (how much memory are we using?) and the catalog data in mz_catalog.mz_objects (what's defined?) to see how much memory each operator is consuming:

-- attribute memory usage 
 SELECT mo.name AS name, global_id, lir_id, parent_lir_id, 
        REPEAT(' ', nesting * 2) || operator AS operator, 
        pg_size_pretty(SUM(size)) AS size 
   FROM           mz_introspection.mz_lir_mapping mlm 
        LEFT JOIN mz_introspection.mz_arrangement_sizes mas 
               ON (    mlm.operator_id_start <= mas.operator_id 
                   AND mas.operator_id < mlm.operator_id_end) 
             JOIN mz_catalog.mz_objects mo 
               ON (mlm.global_id = mo.id) 
  WHERE mo.name LIKE '%top_buyers%' 
GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting 
ORDER BY global_id, lir_id DESC; 
 sql

name

global_id

lir_id

parent_lir_id

operator

size

top_buyers

u195

null

TopK::Basic 10

1011 kB

top_buyers

u195

MapFilterProject 9

null

top_buyers

u195

Reduce::Accumulable 8

316 kB

top_buyers

u195

Join::Differential 6 » 7

null

top_buyers

u195

Get::PassArrangements u191

null

top_buyers

u195

Arrange 5

1738 bytes

top_buyers

u195

MapFilterProject 4

null

top_buyers

u195

TopK::Basic 3

36 kB

top_buyers

u195

Arrange 2

null

top_buyers

u195

Reduce::Accumulable 1

20 kB

top_buyers

u195

Get::Arrangement u191

null

idx_top_buyers

u196

null

Arrange 12

1882 bytes

idx_top_buyers

u196

Get::PassArrangements u195

null

The results show information for two high-level objects: the top_buyers view and the idx_top_buyers index. Each of these has a global_id---an internal identifier for that object. Each object has several LIR operators, with most of the operators living in the view.

Each LIR operator represents a high-level part of the plan; you can read more about it in our EXPLAIN docs. Operators form a tree, like expressions in any programming language. We've used mz_lir_mapping's nesting field to indent the operators nicely; ordering by lir_id descending puts the operators in the correct order.3

In mz_lir_mapping, these operators always are a single line, of the form OperatorName c1 c2 ..., where each ci is a "child ID"---the lir_id of one of the inputs of that operator. So TopK::Basic 10 indicates a TopK operator that reads from the operator with lir_id 10 (which, in our example, is a MapFilterProject). The Join::Differential 6 » 7 line indicates a differential join of the inputs in LIR ids 6 and 7.

The size column sums up the sizes in mz_arrangement_sizes for every dataflow operator used to implement a given LIR operator. Looking at the size column above, it seems that the outermost TopK is the expensive one.

It's worth getting a sense of how much runtime data we've just aggregated together when analyzing this relatively simple query. Most LIR operators correspond to quite a few dataflow operators; many have arrangements. Let's adapt our query to count dataflows and arrangements in the top_buyers view:

SELECT REPEAT(' ', nesting * 2) || operator AS operator, 
        operator_id_end - operator_id_start AS dataflow_ops, 
        COUNT(mas.size) AS arrangements 
   FROM           mz_introspection.mz_lir_mapping mlm 
        LEFT JOIN mz_introspection.mz_arrangement_sizes mas 
               ON (    mlm.operator_id_start <= mas.operator_id 
                   AND mas.operator_id < mlm.operator_id_end) 
  WHERE global_id = 'u195' 
GROUP BY lir_id, operator, nesting, operator_id_end, operator_id_start 
ORDER BY lir_id desc; 
 sql

operator

dataflows_ops

arrangements

TopK::Basic 10

162

MapFilterProject 9

Reduce::Accumulable 8

Join::Differential 6 » 7

Get::PassArrangements u191

Arrange 5

MapFilterProject 4

TopK::Basic 3

162

Arrange 2

Reduce::Accumulable 1

Get::Arrangement u191

If we tried to attribute memory at the dataflow level to top_buyers, we would have to poke through these hundreds of dataflow operators and dozens of arrangements. But with our source map above, it's easy to get a structured summary of dataflow metrics: the outermost TopK::Basic 10 operator uses the lion's share of memory.

Setting hints for `TopK` queries

The TopK operator works by building a tower of arrangements of decreasing size: even if you've only asked for the top 5 elements, Materialize can only incrementally maintain the view if somewhere it maintains a complete ranking. (The tower helps us keep latency low and incremental maintenance cheap.) By default, Materialize will allocate eight generously sized levels for the arrangements in a TopK. Our toy example is so small, we're surely wasting a lot of that space. Let's use the LIMIT INPUT GROUP SIZE hint to tell Materialize the expected group size on the input---which informs how tall to make the tower. But what hint should we give?

Materialize already uses runtime data to offer hints on group sizing for existing dataflows, via mz_introspection.mz_expected_group_size_advice. But if we simply take a peek, we'll see that there are two TopK operators (corresponding to TopK::Basic 10 and TopK::Basic 3 in mz_lir_mapping):

1 SELECT * FROM mz_introspection.mz_expected_group_size_advice; 
 sql

dataflow_id

dataflow_name

region_id

region_name

levels

to_cut

savings

hint

Dataflow: materialize.public.idx_top_buyers

33066

TopK

20582

255.0

Dataflow: materialize.public.idx_top_buyers

33329

TopK

715344

4095.0

Which TopK corresponds to which LIMIT clause in our query? An astute observer might have a guess from having attributed memory usage; an experienced field engineer might have a guess from the region_id. But with mz_lir_mapping, we don't have to guess:

-- topk hints 
 SELECT mo.name AS name, mlm.global_id AS global_id, lir_id, parent_lir_id, 
        REPEAT(' ', nesting * 2) || operator AS operator, 
        levels, to_cut, pg_size_pretty(savings) AS savings, hint 
   FROM           mz_introspection.mz_lir_mapping mlm 
             JOIN mz_introspection.mz_dataflow_global_ids mdgi 
               ON (mlm.global_id = mdgi.global_id) 
        LEFT JOIN mz_introspection.mz_expected_group_size_advice megsa 
               ON (    megsa.dataflow_id = mdgi.id 
                   AND mlm.operator_id_start <= megsa.region_id 
                   AND megsa.region_id < mlm.operator_id_end) 
             JOIN mz_catalog.mz_objects mo 
               ON (mlm.global_id = mo.id) 
  WHERE mo.name LIKE '%top_buyers%' 
ORDER BY mlm.global_id, lir_id DESC; 
 sql

name

global_id

lir_id

parent_lir_id

operator

levels

to_cut

savings

hint

top_buyers

u195

null

TopK::Basic 10

699 kB

4095.0

top_buyers

u195

MapFilterProject 9

null

top_buyers

u195

Reduce::Accumulable 8

null

top_buyers

u195

Join::Differential 6 » 7

null

top_buyers

u195

Get::PassArrangements u191

null

top_buyers

u195

Arrange 5

null

top_buyers

u195

MapFilterProject 4

null

top_buyers

u195

TopK::Basic 3

20 kB

255.0

top_buyers

u195

Arrange 2

null

top_buyers

u195

Reduce::Accumulable 1

null

top_buyers

u195

Get::Arrangement u191

null

idx_top_buyers

u196

null

Arrange 12

null

idx_top_buyers

u196

Get::PassArrangements u195

null

The outermost TopK---the one responsible for so much memory---should be sized a little larger than the inner one. Making the fix is not so hard: DROP the old definitions and recreate them with the limits in place:

DROP VIEW popular_products CASCADE; 

 
CREATE VIEW popular_products(product_id) AS 
 SELECT product_id 
   FROM sales 
GROUP BY product_id 
OPTIONS (LIMIT INPUT GROUP SIZE = 255) 
ORDER BY COUNT(product_id) DESC 
  LIMIT 5; 

 
CREATE VIEW popular_purchases(product_id, customer_id, count) AS 
 SELECT product_id, customer_id, COUNT(customer_id) 
   FROM      popular_products 
        JOIN sales 
        USING (product_id) 
GROUP BY product_id, customer_id; 

 
CREATE VIEW top_buyers(customer_id, product_id, count) AS 
 SELECT customer_id, product_id, count 
 FROM (SELECT DISTINCT product_id FROM popular_purchases) products, 
      LATERAL (  SELECT customer_id, count FROM popular_purchases 
                  WHERE product_id = products.product_id 
                OPTIONS (LIMIT INPUT GROUP SIZE = 4095) 
               ORDER BY count DESC 
                  LIMIT 5) 
ORDER BY product_id, count DESC; 

 
CREATE INDEX idx_top_buyers ON top_buyers(product_id); 
 sql

Let's rerun our memory attribution query. We'll see a roughly 70% reduction in memory usage for both of the TopK operators:

name

global_id

lir_id

parent_lir_id

operator

size

top_buyers

u199

null

TopK::Basic 10

305 kB

top_buyers

u199

MapFilterProject 9

null

top_buyers

u199

Reduce::Accumulable 8

316 kB

top_buyers

u199

Join::Differential 6 » 7

null

top_buyers

u199

Get::PassArrangements u191

null

top_buyers

u199

Arrange 5

1738 bytes

top_buyers

u199

MapFilterProject 4

null

top_buyers

u199

TopK::Basic 3

9272 bytes

top_buyers

u199

Arrange 2

null

top_buyers

u199

Reduce::Accumulable 1

20 kB

top_buyers

u199

Get::Arrangement u191

null

idx_top_buyers

u200

null

Arrange 12

1882 bytes

idx_top_buyers

u200

Get::PassArrangements u199

null

(Notice that the the global_ids have changed, because we DROPped and recreated the VIEWs and index.)

What's next?

Eventually, we'll build syntax like EXPLAIN ANALYZE INDEX ... around queries like these---once we know which information helps the most. For now, we've documented these and some other common debugging queries; we expect our field engineering team and users to extend these queries and adapt them to their own ends.

There's a separate language design problem, too: what's the right level of abstraction for EXPLAIN? If you run EXPLAIN PLAN on a query today, we give a very detailed static plan, with many lines for each LIR operator. If you query mz_lir_mapping, you'll see a terse, one-line description for each LIR operator. What's the right level of detail? As users get more experience debugging their live queries, we'll get a better sense of what to show and what to hide.

These source maps have already turned arduous, manual tasks that took hours into quick glances that take minutes. Live debugging information makes it much easier to write better queries... so fire up the Materialize emulator and play around with these new features!

Footnotes

These cost estimates are used in Postgres's query planning, but they should be taken with a grain of salt. ↩

The values here should be stable across versions of Materialize---seahash values should only change at their major version bumps. ↩

lir_ids number the nodes of the LIR abstract syntax tree in a left-to-right, post-order traversal. ↩

Michael Greenberg

Technical Consultant, Materialize

Michael is an assistant professor of computer science at Stevens Institute of Technology, doing research in systems and programming languages. He worked previously at Pomona College, after a postdoc at Princeton and PhD from the University of Pennsylvania.

`1`	`CREATE TABLE sales (product_id BIGINT NOT NULL, customer_id BIGINT NOT NULL);`
`2`	`CREATE INDEX idx_sales_by_product ON sales(product_id);`
`3`
`4`	`-- generates 100 products and 10k customers, each buying 16 products`
`5`	`INSERT INTO sales (product_id, customer_id)`
`6`	`WITH`
`7`	`product_seed(salt) AS`
`8`	`(VALUES ('abc'), ('def'), ('ghi'), ('jkl'), ('mno'), ('qrs'), ('tuv'), ('wxy'),`
`9`	`('zAB'), ('CDE'), ('FGH'), ('IJK'), ('LMN'), ('OPQ'), ('RST'), ('UVW'))`
`10`	`SELECT seahash(product_seed.salt \|\| customer_id::text) % 100 AS product_id,`
`11`	`customer_id`
`12`	`FROM generate_series(1, 10000) AS customer_id`
`13`	`CROSS JOIN product_seed;`

`1`	`SELECT MIN(count), AVG(count), ROUND(STDDEV(count), 1) AS stddev, MAX(count)`
`2`	`FROM (SELECT COUNT(product_id) AS count`
`3`	`FROM sales`
`4`	`GROUP BY product_id`
`5`	`ORDER BY count DESC);`

`1`	`-- top 5 most popular products`
`2`	`CREATE VIEW popular_products(product_id) AS`
`3`	`SELECT product_id`
`4`	`FROM sales`
`5`	`GROUP BY product_id`
`6`	`ORDER BY COUNT(product_id) DESC`
`7`	`LIMIT 5;`
`8`
`9`	`-- number of purchases of each popular product, per customer`
`10`	`CREATE VIEW popular_purchases(product_id, customer_id, count) AS`
`11`	`SELECT product_id, customer_id, COUNT(customer_id)`
`12`	`FROM popular_products`
`13`	`JOIN sales`
`14`	`USING (product_id)`
`15`	`GROUP BY product_id, customer_id;`
`16`
`17`	`-- top 5 buyers of each of the top five popular products`
`18`	`CREATE VIEW top_buyers(customer_id, product_id, count) AS`
`19`	`SELECT customer_id, product_id, count`
`20`	`FROM (SELECT DISTINCT product_id FROM popular_purchases) products,`
`21`	`LATERAL ( SELECT customer_id, count FROM popular_purchases`
`22`	`WHERE product_id = products.product_id`
`23`	`ORDER BY count DESC`
`24`	`LIMIT 5)`
`25`	`ORDER BY product_id, count DESC;`

`1`	`-- attribute memory usage`
`2`	`SELECT mo.name AS name, global_id, lir_id, parent_lir_id,`
`3`	`REPEAT(' ', nesting * 2) \|\| operator AS operator,`
`4`	`pg_size_pretty(SUM(size)) AS size`
`5`	`FROM mz_introspection.mz_lir_mapping mlm`
`6`	`LEFT JOIN mz_introspection.mz_arrangement_sizes mas`
`7`	`ON ( mlm.operator_id_start <= mas.operator_id`
`8`	`AND mas.operator_id < mlm.operator_id_end)`
`9`	`JOIN mz_catalog.mz_objects mo`
`10`	`ON (mlm.global_id = mo.id)`
`11`	`WHERE mo.name LIKE '%top_buyers%'`
`12`	`GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting`
`13`	`ORDER BY global_id, lir_id DESC;`

`1`	`SELECT REPEAT(' ', nesting * 2) \|\| operator AS operator,`
`2`	`operator_id_end - operator_id_start AS dataflow_ops,`
`3`	`COUNT(mas.size) AS arrangements`
`4`	`FROM mz_introspection.mz_lir_mapping mlm`
`5`	`LEFT JOIN mz_introspection.mz_arrangement_sizes mas`
`6`	`ON ( mlm.operator_id_start <= mas.operator_id`
`7`	`AND mas.operator_id < mlm.operator_id_end)`
`8`	`WHERE global_id = 'u195'`
`9`	`GROUP BY lir_id, operator, nesting, operator_id_end, operator_id_start`
`10`	`ORDER BY lir_id desc;`

`1`	`-- topk hints`
`2`	`SELECT mo.name AS name, mlm.global_id AS global_id, lir_id, parent_lir_id,`
`3`	`REPEAT(' ', nesting * 2) \|\| operator AS operator,`
`4`	`levels, to_cut, pg_size_pretty(savings) AS savings, hint`
`5`	`FROM mz_introspection.mz_lir_mapping mlm`
`6`	`JOIN mz_introspection.mz_dataflow_global_ids mdgi`
`7`	`ON (mlm.global_id = mdgi.global_id)`
`8`	`LEFT JOIN mz_introspection.mz_expected_group_size_advice megsa`
`9`	`ON ( megsa.dataflow_id = mdgi.id`
`10`	`AND mlm.operator_id_start <= megsa.region_id`
`11`	`AND megsa.region_id < mlm.operator_id_end)`
`12`	`JOIN mz_catalog.mz_objects mo`
`13`	`ON (mlm.global_id = mo.id)`
`14`	`WHERE mo.name LIKE '%top_buyers%'`
`15`	`ORDER BY mlm.global_id, lir_id DESC;`

`1`	`DROP VIEW popular_products CASCADE;`
`2`
`3`	`CREATE VIEW popular_products(product_id) AS`
`4`	`SELECT product_id`
`5`	`FROM sales`
`6`	`GROUP BY product_id`
`7`	`OPTIONS (LIMIT INPUT GROUP SIZE = 255)`
`8`	`ORDER BY COUNT(product_id) DESC`
`9`	`LIMIT 5;`
`10`
`11`	`CREATE VIEW popular_purchases(product_id, customer_id, count) AS`
`12`	`SELECT product_id, customer_id, COUNT(customer_id)`
`13`	`FROM popular_products`
`14`	`JOIN sales`
`15`	`USING (product_id)`
`16`	`GROUP BY product_id, customer_id;`
`17`
`18`	`CREATE VIEW top_buyers(customer_id, product_id, count) AS`
`19`	`SELECT customer_id, product_id, count`
`20`	`FROM (SELECT DISTINCT product_id FROM popular_purchases) products,`
`21`	`LATERAL ( SELECT customer_id, count FROM popular_purchases`
`22`	`WHERE product_id = products.product_id`
`23`	`OPTIONS (LIMIT INPUT GROUP SIZE = 4095)`
`24`	`ORDER BY count DESC`
`25`	`LIMIT 5)`
`26`	`ORDER BY product_id, count DESC;`
`27`
`28`	`CREATE INDEX idx_top_buyers ON top_buyers(product_id);`

Databases have a lot of EXPLAINing to do

Materialize's compilation pipeline

Mapping dataflow metrics up to LIR

Attributing memory usage

Setting hints for TopK queries

What's next?

Footnotes

Related Resources

Databases have a lot of `EXPLAIN`ing to do

Setting hints for `TopK` queries