Recursion in Materialize

Differential Dataflow is capable of incrementally updated iterative computation (recursion) but we haven't yet wired it up to SQL. Let's talk about what recursion could look like in Materialize, and why it's important.

Frank McSherry
Frank McSherryCHIEF SCIENTIST

This post originally published on my personal blog here.

Materialize is a SQL database that uses Differential Dataflow for its computational layer. When Differential Dataflow got invented, it introduced one fundamental novelty: incrementally updated iterative computation. You haven't been able to use this in Materialize yet though, for various reasons not the least of which is that SQL's WITH RECURSIVE clause is a bit of a mess.

The good news is that as of quite recently, Materialize has preliminary (behind the --unsafe-mode flag) support for a tentatively named WITH MUTUALLY RECURSIVE clause. This clause differs from SQL's WITH RECURSIVE in some important ways, and I'll explain what those are and why I'm excited about them.

Recursion in SQL

SQL99 introduced the very useful common table expressions (CTEs), and with them the RECURSIVE modifier that allowed recursive common table expressions. A common table expression allows you to use the WITH clause to name some expressions and then use them multiple times in your query, without resorting to copy/paste:

-- Form the triangles (a, b, c) in a graph.
WITH
    -- symmetrize directed edges
    symm (a, b) AS (
        SELECT a, b FROM edges
        UNION
        SELECT b, a FROM edges
    ),
    -- use `symm` to find length-two paths.
    path2 (a, b, c) AS (
        SELECT DISTINCT e1.a, e1.b, e2.b as c
        FROM symm e1, symm e2
        WHERE e1.b = e2.a
    )
-- Produce triples (a, b, c) where symm(a, c) and path2(a, b, c) exist.
SELECT DISTINCT path2.a, path2.b, path2.c
FROM path2, symm
WHERE path2.a = symm.a
  AND path2.c = symm.b;

You can even use the bindings in subsequent expressions, as we did with symm in path2.

Excitingly, the SQL folks realized that something really neat happens if you allow a binding to refer to itself. Hearts full of excitement (one imagines) they introduced the RECURSIVE modifier that allows this.

WITH RECURSIVE
    -- symmetrize directed edges
    symm (a, b) AS (
        SELECT a, b FROM edges
        UNION
        SELECT b, a FROM edges
    ),
    -- LOOK THIS IS RECURSIVE!!!
    reach (a, b) AS (
        SELECT * FROM symm
        UNION
        SELECT symm.a, reach.b
        FROM symm, reach
        WHERE symm.b = reach.a
    )
SELECT * FROM reach;

This is the classic example of recursion that you see in languages like Datalog, and StackOverflow pages discussing WITH RECURSIVE, but relatively rarely in actual SQL queries. Why is that?

As it turns out, WTIH RECURSIVE has a bevy of limitations and mysterious semantics (four pages of limitations in the version of the standard I have, and I still haven't found the semantics yet). I certainly cannot enumerate, or even understand the full list, and will defer to the likes of @teggy to expound upon the issues. Fortunately, @teggy does provide a worked example that encapsulates my confusion, that (in PostgreSQL at least)

mcsherry=# WITH RECURSIVE
    t(n) AS (
        VALUES (1)
        UNION ALL
        (
            WITH t AS (SELECT * FROM t)
            SELECT t1.n + t2.n AS n
            FROM t AS t1, t AS t2
            WHERE t1.n < 256
        )
    )
SELECT * FROM t;
  n
-----
   1
   2
   4
   8
  16
  32
  64
 128
 256
(9 rows)

mcsherry=#

There are so many things I don't understand here. Why only powers of two rather than any of their sums? Why no requirement that t2.n be bounded? Why isn't the result a fixed-point of the query that defines t?

The above is an example of "non-linear recursion" (t is used twice in the recursive term), which is both defined and forbidden in the SQL standard. Except that the SQL standard defines linear recursion to be a query that references the recursive term only once, which is a syntactic rather than semantic constraint. They seemed to forget that this was in the part of the standard (WITH clauses) used to rebind names. So according to the SQL standard the above query should be accepted as "linear recursion", and just has the crazy-pants semantics of "evaluate as if linearly recursive".

Recursion in Materialize

Materialize doesn't support SQL's WITH RECURSIVE and based on the complexity of the spec may never support it. Instead, it supports what I (naively?) think is a simpler, and yet more expressive fragment. I'm a bit worried that I don't understand thet rationale behind the complexity of WITH RECURSIVE, and I both expect and will be delighted to have holes poked in what Materialize does instead.

Materialize's WITH MUTUALLY RECURSIVE clause allows a sequence of bindings, each of which can reference any binding in their body, followed by a body that can also reference any binding.

WITH MUTUALLY RECURSIVE
    -- A sequence of bindings, all of which are in scope for all.
    name1 (col1a type1a, col1b type1b, ..) AS ( select_clause1 ),
    name2 (col2a type2a, col2b type2b, ..) AS ( select_clause2 ),
    ...
body_select_clause

The results of the clause are as if you start each binding from an empty collection, then update the definition of each binding in sequence, repeating the list of all bindings until no changes remain, and then evaluate the body with these final bindings. The computation may never stop, in which case .. there is no result and your computer will probably be busy for a while determining that. But if it does stop, the configuration of bindings will be a fixed point, and the clause returns some query over that fixed point.

The mystifying-to-me WITH RECURSIVE query above can also be expressed using MUTUALLY, as

materialize=> WITH MUTUALLY RECURSIVE
    t (n int) AS (
        VALUES (1)
        UNION ALL
        (
            WITH t AS (SELECT * FROM t)
            SELECT DISTINCT t1.n + t2.n
            FROM t AS t1, t AS t2
            WHERE t1.n < 256 AND t2.n < 256
        )
    )
SELECT * FROM t ORDER by n;
  n
-----
   1
   2
   3
   4
[...]
 507
 508
 509
 510
(510 rows)

materialize=>

This produces what is in my opinion the expected fixed point of the query above: all values from 1 through 510. Rather than just the powers of two strictly less than 512. Which isn't even a fixed point of the update rule.

Let's discuss a few differences from SQL's WITH RECURSIVE:

  1. We had to specify the type of the column of numbers. We require this to make the SQL type resolution substantially easier, and not involve a recursive fixed-point problem when coercable types are used. I can imagine we could relax this in the future, bit it isn't meant to be the most important difference.
  2. We had to add the constraint t2.n < 256. The absence of this constraint from the SQL version, and its termination nonetheless, still blows my mind. Of course you have to bound this, otherwise we would continue increasing numbers through the contributions of t2 even with a bounded t1.
  3. We had to type MUTUALLY. We aren't implementing WITH RECURSIVE correctly, so we have to call it something else. MySQL has a flag you can set to step away from SQL's semantics, but adding a new keyword seems easier for us at the moment.

The main other difference is in the limitations. Whereas SQL has some four pages of restrictions, Materialize has none. Put whatever query you want in the definition of a recursive thing. Don't want to use a UNION or UNION ALL? Don't. Don't want to use linear recursion? Me neither! Want to put another WITH MUTUALLY RECURSIVE clause in definition of a binding? Go right ahead, you devious villain!

Materialize having no restrictions has the comic potential to be a massive dumpster fire once we learn the very important reasons why SQL introduced the constraints. However, it seems the best way to elicit that information is with this sort of post.

Is Recursion Really that Important?

Yes.

Maybe not to you, maybe not to people you work with, or whose work you follow, and that is fine. But yes.

Recursion or iteration are fundamental to programming languages. Languages without them are hobbled in their expressive power. Languages with restricted implementations of them can prevent the description of efficient computation. Languages either without, or with only limited forms, prevent their users from applying the full force of computer science.

I spent a fair few years needling folks in the Big Data and Dabatases spaces, pitting my laptop against their large and powerful computers. The secret (shhh!) was that I had access to more computer science than they did. Differential dataflow could express algorithms that they could not (or did not, because of pain). Perhaps their systems could, with human effort, effect the same computation, but why use a system or language that makes computer science hard?

Example 1: Undirected Connectivity

Let's take a first example from the recent and readable A Fix for the Fixation on Fixpoints: undirected connectivity. The algorithm they use is "label propagation": each graph node tracks the smallest identifier it knows of, starting with its own identifier and repeatedly consulting with its neighbors. You can write this in SQL using WITH RECURSIVE the same way we did reach above, followed by a MIN over the reachable nodes.

WITH RECURSIVE
    -- symmetrize directed edges
    symm (a, b) AS (
        SELECT a, b FROM edges
        UNION
        SELECT b, a FROM edges
    ),
    -- LOOK THIS IS RECURSIVE!!!
    reach (a, b) AS (
        SELECT * FROM symm
        UNION
        SELECT symm.a, reach.b
        FROM symm, reach
        WHERE symm.b = reach.a
    )
-- Report the smallest reachable node.
SELECT a, MIN(b) FROM reach GROUP BY a

The paper observes that this query is frustrating because you cannot clearly communicate that as you develop reach you can discard all but the smallest b for each a. You could rely on a sophisticated query optimizer to determine that it can push the MIN into the recursive definition. However, if you and that optimizer disagree on what passes for "sophisticated", you are out of luck. The paper proposes a WITH ITERATIVE construct that makes some different choices than we did, but it also allows you to communicate what data are not required.

In Materialize we can write label propagation as

WITH MUTUALLY RECURSIVE
    -- symmetrize edges
    symm (a int, b int) AS (
        SELECT a, b FROM edges
        UNION
        SELECT b, a FROM edges
    ),
    -- iteratively improve all labels
    label (a int, comp int) AS (
        SELECT a, MIN(comp)
        FROM (
            SELECT a, a AS comp FROM symm
            UNION ALL
            SELECT symm.a, label.comp
            FROM symm, label
            WHERE symm.b = label.a
        )
        GROUP BY a
    )
SELECT * FROM label;

You just describe how you should update label each iteration, in this case by grouped by a keeping the smallest comp. You don't need to end the definition with a UNION especially if that isn't what you want cc to have each iteration. And indeed, in Materialize the memory footprint of this query will stay bounded as the iterations proceed.

A proponent of declarative languages might prefer the WITH RECURSIVE version as "more declarative": you say what you want rather than how to get it. A proponent of imperative languages might counter that at the end of the day someone has to implement this efficiently, and if you won't do it at least don't prevent me. Fortunately, you can just write whichever you prefer.

Example 2: Dynamic Programming

A second example from A Fix for the Fixation on Fixpoints is the CYK algorithm for parsing context-free grammars. There they make the point that it (like other dynamic programming algorithms) are great examples where non-linear recursion is crucial.

-- Symbols, and literals each produces.
CREATE TABLE grammar_terms (lhs int, lit int);
-- Symbols, and two symbols each produces.
CREATE TABLE grammar_nonts (lhs int, rhs1 int, rhs2 int);
-- An input string with literals at positions.
CREATE TABLE input (pos int, lit int);

WITH MUTUALLY RECURSIVE
    -- Ranges `[lower, upper)` that can be produced by `symbol`.
    parses (lower int, upper int, symbol int) AS (
        -- Base case: each literal is produced by some symbols.
        SELECT pos, pos+1, lhs
        FROM input, grammar_terms
        WHERE input.lit = grammar_terms.lit
        UNION
        -- Recursive case: two adjacent parses that follow the grammar.
        SELECT p1.lower, p2.upper, lhs
        FROM parses p1, parses p2, grammar_nonts
        WHERE p1.upper = p2.lower
          AND p1.symbol = grammar_nonts.rhs1
          AND p2.symbol = grammar_nonts.rhs2
    )
SELECT * FROM parses;

We use parses twice in the recursive branch, and it is important for correctness that we do so. It sounds like the "Fix" authors think you can get SQL's WITH RECURSIVE to implement this with some head-balancing, but neither they nor I think that is a good idea.

For bonus points, imagine you want to know how to parse the input, rather than only if it parses. You'd have to tweak the query to add to parses breadcrumb columns about how to find derivations for each parses row, for example columns via, rhs1, and rhs2 for the columns equated in the join. However, you don't need to keep all the derivations for each row in parses; one will suffice. This is again a data reduction we could explain in the language, as with undirected connectivity, without which we risk a much less efficient implementation.

Example 3: Turing completeness

Turing completeness is the property of a language, framework, or system that it can simulate a Turing machine, the standard for "things a computer could possibly do". If your platform is Turing complete you can do all the things a computer can do, and if it is not Turing complete there is some class of things your platform just cannot do. This is usually worrying because if you end up needing to do any of those things, you are just out of luck.

Datalog, for example, is a recursion-friendly language that is not Turing complete. SQL is Turing complete via WITH RECURSIVE, but woe betide the casual person who needs to understand this (start reading here about cyclic tag systems). Materialize is Turing complete via WITH MUTUALLY RECURSIVE because you can just implement a Turing machine.

Let's implement a Turing machine!

We'll start with the configuration of the machine, its tape, and its transitions.

-- The head will hold the read position and machine state.
CREATE TABLE initial_head (pos int, state int);
CREATE TABLE initial_tape (pos int, symb int);
-- Halting states are encoded by setting `motion` to zero and `new_symb` to `old_symb`.
CREATE TABLE transitions (old_symb int, old_state int, new_symb int, new_state int, motion int);

If you want to try things, or see an example for the above, here are some inputs that accept input strings indicating the parity of their length.

-- Optionally, initial values that check parity of the input.
INSERT INTO initial_head VALUES (0, 0);
INSERT INTO initial_tape VALUES (0, 1), (1, 1), (2, 1), (3, 1), (4, 1);
-- We are checking even or oddness of the input tape.
INSERT INTO transitions VALUES
    (0, 0, 0, 0, 0),    -- on a blank, halt
    (0, 1, 0, 1, 0),    -- on a blank, halt
    (1, 0, 1, 1, 1),    -- on a symbol, toggle state
    (1, 1, 1, 0, 1);    -- on a symbel, toggle state

With these input tables, we can get the final head position and state with the following query:

WITH MUTUALLY RECURSIVE
    -- Track the machine's head and state.
    head (pos int, state int) AS (
        -- In the first round use `initial_head`; in later rounds use `head`.
        SELECT * FROM head
        UNION  ALL SELECT * FROM initial_head
        EXCEPT ALL SELECT * FROM initial_head_delay
        -- Apply the movement of the machine
        UNION  ALL SELECT new_pos, new_state FROM action
        EXCEPT ALL SELECT old_pos, old_state FROM action
    ),
    -- Track the tape's contents; absent positions are read as blank.
    tape (pos int, symb int) AS (
        -- In the first round use `initial_tape`; in later rounds use `tape`.
        SELECT * FROM tape
        UNION  ALL SELECT * FROM initial_tape
        EXCEPT ALL SELECT * FROM initial_tape_delay
        -- Apply the modification the head makes
        UNION  ALL SELECT old_pos, new_symb FROM action
        EXCEPT ALL SELECT old_pos, old_symb FROM action
    ),
    -- Determine what sort of transition to take.
    action (
        old_pos int, old_state int, old_symb int,
        new_pos int, new_state int, new_symb int
    ) AS (
        WITH
            -- Read the symbol under the head from the tape.
            -- Rewrite absent tape locations as blanks (`0`).
            read (pos, state, symb) AS (
                SELECT
                    head.pos,
                    head.state,
                    CASE
                        WHEN tape.symb IS NULL THEN 0
                        ELSE tape.symb
                    END
                FROM
                    head LEFT JOIN tape ON (head.pos = tape.pos)
            )
        SELECT
            read.pos, read.state, read.symb,
            read.pos + t.motion, t.new_state, t.new_symb
        FROM read, transitions t
        WHERE read.symb = t.old_symb
          AND read.state = t.old_state
    ),
    -- Delayed versions of the input, to retract in the second iteration.
    initial_head_delay(pos int, state int) AS (SELECT * FROM initial_head),
    initial_tape_delay(pos int, symb int) AS (SELECT * FROM initial_tape)
SELECT * FROM head;

There is an awkward _delay idiom used to present input only in the first round, but otherwise the update rules are probably just what you'd write with WITH RECURSIVE if you were allowed to. It even keeps tape indexed by pos and takes time linear in the number of machine actions taken before it halts. How cool is that?

Conclusion

Recursion is important, and doing recursion well is important. If recursion is too complicated or too confusing, you miss out on the opportunity to express valuable things about the intent of your query. That's a pity, because many useful tasks require artful use of recursion to work effectively.

Fortunately, we are well-positioned to make recursion delightful. You don't need to take thing away from SQL other than the restrictions on recursion.

Also go read A Fix for the Fixation on FixPoints.

Caveats

Materialize's WITH MUTUALLY RECURSIVE is currently experimental, and may never land. It certainly isn't optimized yet (the optimizer literally nopes out of any optimizations other than some normalization). If we learn amazing things about why WITH RECURSIVE is the way that it is, we may need to change what we have done!

We also have a late-arriving proposal from Jamie Brandon to use ACTUALLY instead of MUTUALLY.

If you have questions or ideas on recursion in Materialize, and how you could use it, please let us know in the community Slack! We'll be sure to share any updates on this idea there as well.

More Articles

Technical Article

Delta Joins and Late Materialization

Understand how to optimize joins with indexes and late materialization.

Jan 18, 2023

Conceptual Article

Our experience with Rust!

Materialize is written in Rust. Why did we make that decision and how has it turned out for the project?

Gus Wynn

Dec 6, 2022

Technical Article

Real-Time Customer Data Platform Views on Materialize

Let's demonstrate the unique features of Materialize by building the core functionality of a customer data platform.

Andy Hattemer

Oct 19, 2022

Join the Materialize Community

Join hundreds of other Materialize users and connect directly with our engineers.

Join the Community

© 2023 Materialize, Inc. Terms of Service