For context, Materialize is a streaming database that maintains the results of a query (defined as a materialized view) as the data changes. Whereas a traditional database does the evaluation when a
SELECT statement is issued, Materialize asks for queries upfront and computes the results on the fly as new data arrives.
Despite that fundamental difference, given the same data and SQL, users expect Materialize to produce the exact same results as PostgreSQL. So correctness is of extreme importance to Materialize as it hopes to replace existing traditional databases and enable new processing pipelines. Performance is also front and center as users would hate it if a query that previously took a millisecond starts taking an hour to execute.
SQL engines and optimizers are special beasts to test, with a significant body of academic literature and even a dedicated conference workshop. SQL is a declarative language that allows the user to specify an arbitrary query and the database needs to execute it and return correct results in reasonable time. At the same time, the space of all possible SQL statements is infinite and there is no hope ever to cover it all.
In tackling such a testing task, we can not constrain ourselves to the requirements, the SQL standard, or user use cases, as none of those starting points will allow us to cover the complete picture of the interactions between individual SQL features and optimizations and the rest of the optimizations and elements of the language. Even if an optimization is being created to address the needs of a particular customer, it still has the potential to affect many more queries than just those that the customer would like to see improved. A little creativity and some tamed combinatorial explosion is in order.
So, how do they do it?
Given that the infinite problem space, we want to reduce the problem somehow in order to focus on the task at hand. To make the endeavour tractable, we break it down into the following three components:
- Pick a query or fragment thereof that has the potential to exercise the code that we are testing;
- Decide on the inputs or data we are going to feed in;
- Select the various contexts in which the query fragment can legitimately appear;
Making good choices about #1 and #2 will decide how well we will exercise the code of the feature itself, including any possible corner cases. Selecting appropriate values for #2 and #3 will cause the feature to be exercised in conjunction with other optimizations.
A practical example
Let’s look at a hypothetical situation where we are testing an optimization of the SQL construct
GROUP BY ... ORDER BY ... LIMIT. We would tackle the problem as follows:
1. The choice of a query fragment
In this task, we have the general shape of the query cut out for us -- we are interested in the following pattern:
SELECT ... FROM data_source GROUP BY grouping_clause ORDER BY order_clause LIMIT limit_value
Within this template, we need to flesh out all the lowercase parts in order to get to all-uppercase SQL queries that are runnable.
2. The data source
In this case, it is the relation that we will be grouping on. For brevity, we will show only outer joins and derived tables as sources, as evaluating and optimizing those is a complex and frequently error-prone business. We will separate the alternatives we are exploring with the
| character, in the style of YACC grammars:
source: t1 LEFT JOIN t2 ON (col1) | (SELECT * FROM t1) AS a1
t2 themselves can be base tables, materialized views, or other types of sources, such as Kafka topics and S3 buckets.
3. The rest of the clauses in the query
We would also want to pick representative values for the
GROUP BY and
ORDER BY clauses and for the
GROUP BY side, we assume it matters if we are grouping by a single column, multiple columns, or some sort of an expression. We can formalize that as:
grouping_clause: col1 | col1, col2 | col1 + col2
ORDER BY, we can use column numbers to avoid repeating ourselves:
order_clause: 1 | 2 | 1 , 2
LIMIT side, we should make sure we are testing the important special-cases of
LIMIT 0 and
LIMIT 1 as well as some larger
LIMIT values (both smaller and larger than the expected size of the source relation):
limit_value: 0 | 1 | 10 | 1000000000
4. The context
Now that we have defined alternatives for all the constituent parts of the query fragment, we need to pick the contexts in which the entire fragment would appear. There are quite a few places where SQL allows you to put a complete SELECT statement, so for brevity we will only list a couple (a
UNION and an uncorrelated subquery):
query: select UNION ALL select | SELECT * FROM t1 WHERE f1 IN ( select )
This way, little by little, we have formalized the space we would like to test.
So now let's dive into the ocean head-first. Here is what we need to do to get to a properly tested feature:
We manually create interesting and representative queries from the space, to be added to our regression tests. For each query, we record the query plan and the expected result. We visually examine the query plans to make sure that the desired optimization is kicking in when expected and the query plan is sound.
Here is a typical query we could manually create using our systematic approach that can be added to a functional test suite:
EXPLAIN SELECT * FROM ( SELECT * FROM t1 WHERE col1 IN ( SELECT col1 + col2 FROM t2 GROUP BY col1 + col2 ORDER BY 1 LIMIT 1 ) ) AS a1 WHERE col1 > 1;
Thankfully, we do not need to calculate the expected result of each query using a pen and a pencil! We obtain it from a trusted oracle, which in our case is Postgres.
We feed the SQL grammar we have produced into a tool that generates thousands of pseudo-random queries from the desired space. Those queries are then executed against two revisions of the product, or against two database products, to detect any regressions in correctness or performance. The tool reports any queries where the results or the query plans are different, applies automatic test case simplification, and presents a report.
Here is a typical query that would be generated by a tool-based approach:
SELECT a2.f1 + a1.f2 AS c1, a1.f2 AS c2, 6 AS c3, avg(a1.f2) AS agg1, avg(a1.f1) AS agg2 FROM t2 AS a1 LEFT JOIN ( SELECT count(a2.f1 + 9) AS f1, max(a2.f2 + a2.f1) AS f2 FROM t2 AS a1 JOIN t1 AS a2 ON (a1.f2 < a1.f2) WHERE NOT EXISTS ( SELECT a1.f1 AS c1, a1.f2 + a2.f1 AS c2, a1.f2 AS c3, count(a1.f1 + a1.f1) AS agg1, count(a2.f2 + a2.f1) AS agg2 FROM ( SELECT avg(a1.f1) AS f1, count(a1.f1 + a2.f1) AS f2 FROM t1 AS a1 LEFT JOIN t1 AS a2 ON (NOT (NOT (a1.f1 + a1.f2 IS NULL))) WHERE 6 = a2.f1 AND 6 IN ( 4, 3 ) AND a2.f2 IS NULL AND 5 = a1.f1 ) AS a1 RIGHT JOIN t2 AS a2 ON (a2.f2 + a1.f2 IS NULL) WHERE 3 > a2.f2 + 6 AND a2.f2 IS NULL AND 1 NOT IN ( 4, 8 ) GROUP BY 1, 2, 3 ORDER BY 1 LIMIT 1 ) AND a1.f2 + a2.f2 > a2.f1 + a2.f2 + 6 ) AS a2 ON (a2.f2 + 9 IS NOT NULL) WHERE a1.f2 IS NOT NULL AND a2.f2 + a1.f1 = 7 GROUP BY 1, 2, 3;
(pretty-printing courtesy of Matt Jibson's mzfmt tool)
To cover all bases, we need both types of queries in our testing. The simpler, hand-crafted queries can be reasoned about and checked for correctness visually. The more complex ones may not comprehensible to a human, but probe for corner-cases that we can not hope to think of ourselves.
Ideally we would want the developers to be comfortable working with and maintaining any tests that we add, so it is an ongoing effort to put the test cases in a familiar framework, such as the SQL Logic Test and keep the total test case count and execution time to an acceptable level.
To achieve that, we intend to apply various machine learning algorithms to distill the multitudes of random queries one could quickly generate to a much smaller representative sample that is quicker to run and maintain and yet provides sufficient test coverage. More on that in a future episode.
The big picture
Those examples barely scratch the surface of the bottomless ocean that is the testing of query optimization. And yet, query optimization is just one component of what a complete database is, especially in the case of a product such as Materialize that deals with concurrency, streaming, and multiple sources of data.
For all of its soul-crushing complexity, testing and finding bugs in databases brings such primal joy to the practitioner that I assume was last experienced by the participants of the piano smashing competitions of old. If you think this sounds fun, our piano is particularly ornate and we are hiring in QA/Automation and across the company.
[Messrs. Lawrence Eades, Reg Clegg and Enoch Hinchliffe. the Highgate Colliery team and the ‘champs,’ surrounded by children collecting bits of piano for their bonfires after the piano smashing contest on Saturday at the Halfway Hotel, Highgate. South Yorkshire Times, November 11, 1967]