Recursion in Materialize
Differential Dataflow is capable of incrementally updated iterative computation (recursion) but we haven't yet wired it up to SQL. Let's talk about what recursion could look like in Materialize, and why it's important.
This post originally published on my personal blog here.
Materialize is a SQL database that uses Differential Dataflow for its computational layer.
When Differential Dataflow got invented, it introduced one fundamental novelty: incrementally updated iterative computation.
You haven't been able to use this in Materialize yet though, for various reasons not the least of which is that SQL's
WITH RECURSIVE clause is a bit of a mess.
The good news is that as of quite recently, Materialize has preliminary (behind the
--unsafe-mode flag) support for a tentatively named
WITH MUTUALLY RECURSIVE clause.
This clause differs from SQL's
WITH RECURSIVE in some important ways, and I'll explain what those are and why I'm excited about them.
Recursion in SQL
SQL99 introduced the very useful common table expressions (CTEs), and with them the
RECURSIVE modifier that allowed recursive common table expressions.
A common table expression allows you to use the
WITH clause to name some expressions and then use them multiple times in your query, without resorting to copy/paste:
-- Form the triangles (a, b, c) in a graph. WITH -- symmetrize directed edges symm (a, b) AS ( SELECT a, b FROM edges UNION SELECT b, a FROM edges ), -- use `symm` to find length-two paths. path2 (a, b, c) AS ( SELECT DISTINCT e1.a, e1.b, e2.b as c FROM symm e1, symm e2 WHERE e1.b = e2.a ) -- Produce triples (a, b, c) where symm(a, c) and path2(a, b, c) exist. SELECT DISTINCT path2.a, path2.b, path2.c FROM path2, symm WHERE path2.a = symm.a AND path2.c = symm.b;
You can even use the bindings in subsequent expressions, as we did with
Excitingly, the SQL folks realized that something really neat happens if you allow a binding to refer to itself.
Hearts full of excitement (one imagines) they introduced the
RECURSIVE modifier that allows this.
WITH RECURSIVE -- symmetrize directed edges symm (a, b) AS ( SELECT a, b FROM edges UNION SELECT b, a FROM edges ), -- LOOK THIS IS RECURSIVE!!! reach (a, b) AS ( SELECT * FROM symm UNION SELECT symm.a, reach.b FROM symm, reach WHERE symm.b = reach.a ) SELECT * FROM reach;
This is the classic example of recursion that you see in languages like Datalog, and StackOverflow pages discussing
WITH RECURSIVE, but relatively rarely in actual SQL queries.
Why is that?
As it turns out,
WTIH RECURSIVE has a bevy of limitations and mysterious semantics (four pages of limitations in the version of the standard I have, and I still haven't found the semantics yet).
I certainly cannot enumerate, or even understand the full list, and will defer to the likes of @teggy to expound upon the issues.
@teggy does provide a worked example that encapsulates my confusion, that (in PostgreSQL at least)
mcsherry=# WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL ( WITH t AS (SELECT * FROM t) SELECT t1.n + t2.n AS n FROM t AS t1, t AS t2 WHERE t1.n < 256 ) ) SELECT * FROM t; n ----- 1 2 4 8 16 32 64 128 256 (9 rows) mcsherry=#
There are so many things I don't understand here.
Why only powers of two rather than any of their sums?
Why no requirement that
t2.n be bounded?
Why isn't the result a fixed-point of the query that defines
The above is an example of "non-linear recursion" (
t is used twice in the recursive term), which is both defined and forbidden in the SQL standard.
Except that the SQL standard defines linear recursion to be a query that references the recursive term only once, which is a syntactic rather than semantic constraint.
They seemed to forget that this was in the part of the standard (
WITH clauses) used to rebind names.
So according to the SQL standard the above query should be accepted as "linear recursion", and just has the crazy-pants semantics of "evaluate as if linearly recursive".
Recursion in Materialize
Materialize doesn't support SQL's
WITH RECURSIVE and based on the complexity of the spec may never support it.
Instead, it supports what I (naively?) think is a simpler, and yet more expressive fragment.
I'm a bit worried that I don't understand thet rationale behind the complexity of
WITH RECURSIVE, and I both expect and will be delighted to have holes poked in what Materialize does instead.
WITH MUTUALLY RECURSIVE clause allows a sequence of bindings, each of which can reference any binding in their body, followed by a body that can also reference any binding.
WITH MUTUALLY RECURSIVE -- A sequence of bindings, all of which are in scope for all. name1 (col1a type1a, col1b type1b, ..) AS ( select_clause1 ), name2 (col2a type2a, col2b type2b, ..) AS ( select_clause2 ), ... body_select_clause
The results of the clause are as if you start each binding from an empty collection, then update the definition of each binding in sequence, repeating the list of all bindings until no changes remain, and then evaluate the body with these final bindings. The computation may never stop, in which case .. there is no result and your computer will probably be busy for a while determining that. But if it does stop, the configuration of bindings will be a fixed point, and the clause returns some query over that fixed point.
WITH RECURSIVE query above can also be expressed using
materialize=> WITH MUTUALLY RECURSIVE t (n int) AS ( VALUES (1) UNION ALL ( WITH t AS (SELECT * FROM t) SELECT DISTINCT t1.n + t2.n FROM t AS t1, t AS t2 WHERE t1.n < 256 AND t2.n < 256 ) ) SELECT * FROM t ORDER by n; n ----- 1 2 3 4 [...] 507 508 509 510 (510 rows) materialize=>
This produces what is in my opinion the expected fixed point of the query above: all values from 1 through 510. Rather than just the powers of two strictly less than 512. Which isn't even a fixed point of the update rule.
Let's discuss a few differences from SQL's
- We had to specify the type of the column of
numbers. We require this to make the SQL type resolution substantially easier, and not involve a recursive fixed-point problem when coercable types are used. I can imagine we could relax this in the future, bit it isn't meant to be the most important difference.
- We had to add the constraint
t2.n < 256. The absence of this constraint from the SQL version, and its termination nonetheless, still blows my mind. Of course you have to bound this, otherwise we would continue increasing
numbersthrough the contributions of
t2even with a bounded
- We had to type
MUTUALLY. We aren't implementing
WITH RECURSIVEcorrectly, so we have to call it something else. MySQL has a flag you can set to step away from SQL's semantics, but adding a new keyword seems easier for us at the moment.
The main other difference is in the limitations.
Whereas SQL has some four pages of restrictions, Materialize has none.
Put whatever query you want in the definition of a recursive thing.
Don't want to use a
UNION ALL? Don't.
Don't want to use linear recursion? Me neither!
Want to put another
WITH MUTUALLY RECURSIVE clause in definition of a binding? Go right ahead, you devious villain!
Materialize having no restrictions has the comic potential to be a massive dumpster fire once we learn the very important reasons why SQL introduced the constraints. However, it seems the best way to elicit that information is with this sort of post.
Is Recursion Really that Important?
Maybe not to you, maybe not to people you work with, or whose work you follow, and that is fine. But yes.
Recursion or iteration are fundamental to programming languages. Languages without them are hobbled in their expressive power. Languages with restricted implementations of them can prevent the description of efficient computation. Languages either without, or with only limited forms, prevent their users from applying the full force of computer science.
I spent a fair few years needling folks in the Big Data and Dabatases spaces, pitting my laptop against their large and powerful computers. The secret (shhh!) was that I had access to more computer science than they did. Differential dataflow could express algorithms that they could not (or did not, because of pain). Perhaps their systems could, with human effort, effect the same computation, but why use a system or language that makes computer science hard?
Example 1: Undirected Connectivity
Let's take a first example from the recent and readable A Fix for the Fixation on Fixpoints: undirected connectivity.
The algorithm they use is "label propagation": each graph node tracks the smallest identifier it knows of, starting with its own identifier and repeatedly consulting with its neighbors.
You can write this in SQL using
WITH RECURSIVE the same way we did
reach above, followed by a
MIN over the reachable nodes.
WITH RECURSIVE -- symmetrize directed edges symm (a, b) AS ( SELECT a, b FROM edges UNION SELECT b, a FROM edges ), -- LOOK THIS IS RECURSIVE!!! reach (a, b) AS ( SELECT * FROM symm UNION SELECT symm.a, reach.b FROM symm, reach WHERE symm.b = reach.a ) -- Report the smallest reachable node. SELECT a, MIN(b) FROM reach GROUP BY a
The paper observes that this query is frustrating because you cannot clearly communicate that as you develop
reach you can discard all but the smallest
b for each
You could rely on a sophisticated query optimizer to determine that it can push the
MIN into the recursive definition.
However, if you and that optimizer disagree on what passes for "sophisticated", you are out of luck.
The paper proposes a
WITH ITERATIVE construct that makes some different choices than we did, but it also allows you to communicate what data are not required.
In Materialize we can write label propagation as
WITH MUTUALLY RECURSIVE -- symmetrize edges symm (a int, b int) AS ( SELECT a, b FROM edges UNION SELECT b, a FROM edges ), -- iteratively improve all labels label (a int, comp int) AS ( SELECT a, MIN(comp) FROM ( SELECT a, a AS comp FROM symm UNION ALL SELECT symm.a, label.comp FROM symm, label WHERE symm.b = label.a ) GROUP BY a ) SELECT * FROM label;
You just describe how you should update
label each iteration, in this case by grouped by
a keeping the smallest
You don't need to end the definition with a
UNION especially if that isn't what you want
cc to have each iteration.
And indeed, in Materialize the memory footprint of this query will stay bounded as the iterations proceed.
A proponent of declarative languages might prefer the
WITH RECURSIVE version as "more declarative": you say what you want rather than how to get it.
A proponent of imperative languages might counter that at the end of the day someone has to implement this efficiently, and if you won't do it at least don't prevent me.
Fortunately, you can just write whichever you prefer.
Example 2: Dynamic Programming
A second example from A Fix for the Fixation on Fixpoints is the CYK algorithm for parsing context-free grammars. There they make the point that it (like other dynamic programming algorithms) are great examples where non-linear recursion is crucial.
-- Symbols, and literals each produces. CREATE TABLE grammar_terms (lhs int, lit int); -- Symbols, and two symbols each produces. CREATE TABLE grammar_nonts (lhs int, rhs1 int, rhs2 int); -- An input string with literals at positions. CREATE TABLE input (pos int, lit int); WITH MUTUALLY RECURSIVE -- Ranges `[lower, upper)` that can be produced by `symbol`. parses (lower int, upper int, symbol int) AS ( -- Base case: each literal is produced by some symbols. SELECT pos, pos+1, lhs FROM input, grammar_terms WHERE input.lit = grammar_terms.lit UNION -- Recursive case: two adjacent parses that follow the grammar. SELECT p1.lower, p2.upper, lhs FROM parses p1, parses p2, grammar_nonts WHERE p1.upper = p2.lower AND p1.symbol = grammar_nonts.rhs1 AND p2.symbol = grammar_nonts.rhs2 ) SELECT * FROM parses;
parses twice in the recursive branch, and it is important for correctness that we do so.
It sounds like the "Fix" authors think you can get SQL's
WITH RECURSIVE to implement this with some head-balancing, but neither they nor I think that is a good idea.
For bonus points, imagine you want to know how to parse the input, rather than only if it parses.
You'd have to tweak the query to add to
parses breadcrumb columns about how to find derivations for each
parses row, for example columns
rhs2 for the columns equated in the join.
However, you don't need to keep all the derivations for each row in
parses; one will suffice.
This is again a data reduction we could explain in the language, as with undirected connectivity, without which we risk a much less efficient implementation.
Example 3: Turing completeness
Turing completeness is the property of a language, framework, or system that it can simulate a Turing machine, the standard for "things a computer could possibly do". If your platform is Turing complete you can do all the things a computer can do, and if it is not Turing complete there is some class of things your platform just cannot do. This is usually worrying because if you end up needing to do any of those things, you are just out of luck.
Datalog, for example, is a recursion-friendly language that is not Turing complete.
SQL is Turing complete via
WITH RECURSIVE, but woe betide the casual person who needs to understand this (start reading here about cyclic tag systems).
Materialize is Turing complete via
WITH MUTUALLY RECURSIVE because you can just implement a Turing machine.
Let's implement a Turing machine!
We'll start with the configuration of the machine, its tape, and its transitions.
-- The head will hold the read position and machine state. CREATE TABLE initial_head (pos int, state int); CREATE TABLE initial_tape (pos int, symb int); -- Halting states are encoded by setting `motion` to zero and `new_symb` to `old_symb`. CREATE TABLE transitions (old_symb int, old_state int, new_symb int, new_state int, motion int);
If you want to try things, or see an example for the above, here are some inputs that accept input strings indicating the parity of their length.
-- Optionally, initial values that check parity of the input. INSERT INTO initial_head VALUES (0, 0); INSERT INTO initial_tape VALUES (0, 1), (1, 1), (2, 1), (3, 1), (4, 1); -- We are checking even or oddness of the input tape. INSERT INTO transitions VALUES (0, 0, 0, 0, 0), -- on a blank, halt (0, 1, 0, 1, 0), -- on a blank, halt (1, 0, 1, 1, 1), -- on a symbol, toggle state (1, 1, 1, 0, 1); -- on a symbel, toggle state
With these input tables, we can get the final
head position and state with the following query:
WITH MUTUALLY RECURSIVE -- Track the machine's head and state. head (pos int, state int) AS ( -- In the first round use `initial_head`; in later rounds use `head`. SELECT * FROM head UNION ALL SELECT * FROM initial_head EXCEPT ALL SELECT * FROM initial_head_delay -- Apply the movement of the machine UNION ALL SELECT new_pos, new_state FROM action EXCEPT ALL SELECT old_pos, old_state FROM action ), -- Track the tape's contents; absent positions are read as blank. tape (pos int, symb int) AS ( -- In the first round use `initial_tape`; in later rounds use `tape`. SELECT * FROM tape UNION ALL SELECT * FROM initial_tape EXCEPT ALL SELECT * FROM initial_tape_delay -- Apply the modification the head makes UNION ALL SELECT old_pos, new_symb FROM action EXCEPT ALL SELECT old_pos, old_symb FROM action ), -- Determine what sort of transition to take. action ( old_pos int, old_state int, old_symb int, new_pos int, new_state int, new_symb int ) AS ( WITH -- Read the symbol under the head from the tape. -- Rewrite absent tape locations as blanks (`0`). read (pos, state, symb) AS ( SELECT head.pos, head.state, CASE WHEN tape.symb IS NULL THEN 0 ELSE tape.symb END FROM head LEFT JOIN tape ON (head.pos = tape.pos) ) SELECT read.pos, read.state, read.symb, read.pos + t.motion, t.new_state, t.new_symb FROM read, transitions t WHERE read.symb = t.old_symb AND read.state = t.old_state ), -- Delayed versions of the input, to retract in the second iteration. initial_head_delay(pos int, state int) AS (SELECT * FROM initial_head), initial_tape_delay(pos int, symb int) AS (SELECT * FROM initial_tape) SELECT * FROM head;
There is an awkward
_delay idiom used to present input only in the first round, but otherwise the update rules are probably just what you'd write with
WITH RECURSIVE if you were allowed to.
It even keeps
tape indexed by
pos and takes time linear in the number of machine actions taken before it halts.
How cool is that?
Recursion is important, and doing recursion well is important. If recursion is too complicated or too confusing, you miss out on the opportunity to express valuable things about the intent of your query. That's a pity, because many useful tasks require artful use of recursion to work effectively.
Fortunately, we are well-positioned to make recursion delightful. You don't need to take thing away from SQL other than the restrictions on recursion.
Also go read A Fix for the Fixation on FixPoints.
WITH MUTUALLY RECURSIVE is currently experimental, and may never land.
It certainly isn't optimized yet (the optimizer literally nopes out of any optimizations other than some normalization).
If we learn amazing things about why
WITH RECURSIVE is the way that it is, we may need to change what we have done!
We also have a late-arriving proposal from Jamie Brandon to use
ACTUALLY instead of
If you have questions or ideas on recursion in Materialize, and how you could use it, please let us know in the community Slack! We'll be sure to share any updates on this idea there as well.