Recursion in Materialize

important
Recursive CTEs are now production-ready, available to all Materialize users, and battle-tested at scale—learn more here.
This post originally published on my personal blog here.
Materialize is a SQL database that uses Differential Dataflow for its computational layer. When Differential Dataflow got invented, it introduced one fundamental novelty: incrementally updated iterative computation. You haven't been able to use this in Materialize yet though, for various reasons not the least of which is that SQL's WITH RECURSIVE clause is a bit of a mess.
The good news is that as of quite recently, Materialize has preliminary (behind the --unsafe-mode flag) support for a tentatively named WITH MUTUALLY RECURSIVE clause. This clause differs from SQL's WITH RECURSIVE in some important ways, and I'll explain what those are and why I'm excited about them.
Recursion in SQL
SQL99 introduced the very useful common table expressions (CTEs), and with them the RECURSIVE modifier that allowed recursive common table expressions. A common table expression allows you to use the WITH clause to name some expressions and then use them multiple times in your query, without resorting to copy/paste:
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
You can even use the bindings in subsequent expressions, as we did with symm in path2.
Excitingly, the SQL folks realized that something really neat happens if you allow a binding to refer to itself. Hearts full of excitement (one imagines) they introduced the RECURSIVE modifier that allows this.
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
This is the classic example of recursion that you see in languages like Datalog, and StackOverflow pages discussing WITH RECURSIVE, but relatively rarely in actual SQL queries. Why is that?
As it turns out, WTIH RECURSIVE has a bevy of limitations and mysterious semantics (four pages of limitations in the version of the standard I have, and I still haven't found the semantics yet). I certainly cannot enumerate, or even understand the full list, and will defer to the likes of @teggy to expound upon the issues. Fortunately, @teggy does provide a worked example that encapsulates my confusion, that (in PostgreSQL at least)
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | |
25 | |
26 | |
27 | |
There are so many things I don't understand here. Why only powers of two rather than any of their sums? Why no requirement that t2.n be bounded? Why isn't the result a fixed-point of the query that defines t?
The above is an example of "non-linear recursion" (t is used twice in the recursive term), which is both defined and forbidden in the SQL standard. Except that the SQL standard defines linear recursion to be a query that references the recursive term only once, which is a syntactic rather than semantic constraint. They seemed to forget that this was in the part of the standard (WITH clauses) used to rebind names. So according to the SQL standard the above query should be accepted as "linear recursion", and just has the crazy-pants semantics of "evaluate as if linearly recursive".
Recursion in Materialize
Materialize doesn't support SQL's WITH RECURSIVE and based on the complexity of the spec may never support it. Instead, it supports what I (naively?) think is a simpler, and yet more expressive fragment. I'm a bit worried that I don't understand the rationale behind the complexity of WITH RECURSIVE, and I both expect and will be delighted to have holes poked in what Materialize does instead.
Materialize's WITH MUTUALLY RECURSIVE clause allows a sequence of bindings, each of which can reference any binding in their body, followed by a body that can also reference any binding.
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
The results of the clause are as if you start each binding from an empty collection, then update the definition of each binding in sequence, repeating the list of all bindings until no changes remain, and then evaluate the body with these final bindings. The computation may never stop, in which case .. there is no result and your computer will probably be busy for a while determining that. But if it does stop, the configuration of bindings will be a fixed point, and the clause returns some query over that fixed point.
The mystifying-to-me WITH RECURSIVE query above can also be expressed using MUTUALLY, as
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | |
25 | |
26 | |
27 | |
This produces what is in my opinion the expected fixed point of the query above: all values from 1 through 510. Rather than just the powers of two strictly less than 512. Which isn't even a fixed point of the update rule.
Let's discuss a few differences from SQL's WITH RECURSIVE:
- We had to specify the type of the column of
numbers. We require this to make the SQL type resolution substantially easier, and not involve a recursive fixed-point problem when coercable types are used. I can imagine we could relax this in the future, bit it isn't meant to be the most important difference. - We had to add the constraint
t2.n < 256. The absence of this constraint from the SQL version, and its termination nonetheless, still blows my mind. Of course you have to bound this, otherwise we would continue increasingnumbersthrough the contributions oft2even with a boundedt1. - We had to type
MUTUALLY. We aren't implementingWITH RECURSIVEcorrectly, so we have to call it something else. MySQL has a flag you can set to step away from SQL's semantics, but adding a new keyword seems easier for us at the moment.
The main other difference is in the limitations. Whereas SQL has some four pages of restrictions, Materialize has none. Put whatever query you want in the definition of a recursive thing. Don't want to use a UNION or UNION ALL? Don't. Don't want to use linear recursion? Me neither! Want to put another WITH MUTUALLY RECURSIVE clause in definition of a binding? Go right ahead, you devious villain!
Materialize having no restrictions has the comic potential to be a massive dumpster fire once we learn the very important reasons why SQL introduced the constraints. However, it seems the best way to elicit that information is with this sort of post.
Is Recursion Really that Important?
Yes.
Maybe not to you, maybe not to people you work with, or whose work you follow, and that is fine. But yes.
Recursion or iteration are fundamental to programming languages. Languages without them are hobbled in their expressive power. Languages with restricted implementations of them can prevent the description of efficient computation. Languages either without, or with only limited forms, prevent their users from applying the full force of computer science.
I spent a fair few years needling folks in the Big Data and Databases spaces, pitting my laptop against their large and powerful computers. The secret (shhh!) was that I had access to more computer science than they did. Differential dataflow could express algorithms that they could not (or did not, because of pain). Perhaps their systems could, with human effort, effect the same computation, but why use a system or language that makes computer science hard?
Example 1: Undirected Connectivity
Let's take a first example from the recent and readable A Fix for the Fixation on Fixpoints: undirected connectivity. The algorithm they use is "label propagation": each graph node tracks the smallest identifier it knows of, starting with its own identifier and repeatedly consulting with its neighbors. You can write this in SQL using WITH RECURSIVE the same way we did reach above, followed by a MIN over the reachable nodes.
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
The paper observes that this query is frustrating because you cannot clearly communicate that as you develop reach you can discard all but the smallest b for each a. You could rely on a sophisticated query optimizer to determine that it can push the MIN into the recursive definition. However, if you and that optimizer disagree on what passes for "sophisticated", you are out of luck. The paper proposes a WITH ITERATIVE construct that makes some different choices than we did, but it also allows you to communicate what data are not required.
In Materialize we can write label propagation as
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
You just describe how you should update label each iteration, in this case by grouped by a keeping the smallest comp. You don't need to end the definition with a UNION especially if that isn't what you want cc to have each iteration. And indeed, in Materialize the memory footprint of this query will stay bounded as the iterations proceed.
A proponent of declarative languages might prefer the WITH RECURSIVE version as "more declarative": you say what you want rather than how to get it. A proponent of imperative languages might counter that at the end of the day someone has to implement this efficiently, and if you won't do it at least don't prevent me. Fortunately, you can just write whichever you prefer.
Example 2: Dynamic Programming
A second example from A Fix for the Fixation on Fixpoints is the CYK algorithm for parsing context-free grammars. There they make the point that it (like other dynamic programming algorithms) are great examples where non-linear recursion is crucial.
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | |
We use parses twice in the recursive branch, and it is important for correctness that we do so. It sounds like the "Fix" authors think you can get SQL's WITH RECURSIVE to implement this with some head-balancing, but neither they nor I think that is a good idea.
For bonus points, imagine you want to know how to parse the input, rather than only if it parses. You'd have to tweak the query to add to parses breadcrumb columns about how to find derivations for each parses row, for example columns via, rhs1, and rhs2 for the columns equated in the join. However, you don't need to keep all the derivations for each row in parses; one will suffice. This is again a data reduction we could explain in the language, as with undirected connectivity, without which we risk a much less efficient implementation.
Example 3: Turing completeness
Turing completeness is the property of a language, framework, or system that it can simulate a Turing machine, the standard for "things a computer could possibly do". If your platform is Turing complete you can do all the things a computer can do, and if it is not Turing complete there is some class of things your platform just cannot do. This is usually worrying because if you end up needing to do any of those things, you are just out of luck.
Datalog, for example, is a recursion-friendly language that is not Turing complete. SQL is Turing complete via WITH RECURSIVE, but woe betide the casual person who needs to understand this (start reading here about cyclic tag systems). Materialize is Turing complete via WITH MUTUALLY RECURSIVE because you can just implement a Turing machine.
Let's implement a Turing machine!
We'll start with the configuration of the machine, its tape, and its transitions.
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
If you want to try things, or see an example for the above, here are some inputs that accept input strings indicating the parity of their length.
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
With these input tables, we can get the final head position and state with the following query:
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | |
25 | |
26 | |
27 | |
28 | |
29 | |
30 | |
31 | |
32 | |
33 | |
34 | |
35 | |
36 | |
37 | |
38 | |
39 | |
40 | |
41 | |
42 | |
43 | |
44 | |
45 | |
46 | |
47 | |
48 | |
49 | |
50 | |
51 | |
52 | |
There is an awkward _delay idiom used to present input only in the first round, but otherwise the update rules are probably just what you'd write with WITH RECURSIVE if you were allowed to. It even keeps tape indexed by pos and takes time linear in the number of machine actions taken before it halts. How cool is that?
Conclusion
Recursion is important, and doing recursion well is important. If recursion is too complicated or too confusing, you miss out on the opportunity to express valuable things about the intent of your query. That's a pity, because many useful tasks require artful use of recursion to work effectively.
Fortunately, we are well-positioned to make recursion delightful. You don't need to take thing away from SQL other than the restrictions on recursion.
Also go read A Fix for the Fixation on FixPoints.
