Like in any standard relational database, you can use indexes to optimize query performance in Materialize. Improvements can be significant, reducing some query times down to single-digit milliseconds.
Building an efficient index depends on the clauses used in your queries, as well as your expected access patterns. Use the following as a guide:
ORDER BY and
LIMIT clauses currently do not benefit from an index.
WHERE point lookups
Speed up a query involving a
WHERE clause with equality comparisons to literals (e.g.,
You can verify that Materialize is accessing the input by an index lookup using
EXPLAIN. Check for
lookup_value after the index name to confirm that an index lookup is happening, i.e., that Materialize is only reading the matching records from the index instead of scanning the entire index:
EXPLAIN SELECT * FROM foo WHERE x = 42 AND y = 'hello';
Optimized Plan ----------------------------------------------------------------------------- Explained Query (fast path): + Project (#0, #1) + ReadExistingIndex materialize.public.foo_x_y lookup_value=(42, "hello")+ + Used Indexes: + - materialize.public.foo_x_y (lookup) +
Matching multi-column indexes to multi-column
In general, your index key should exactly match the columns that are constrained in the
WHERE clause. In more detail:
WHEREclause constrains fewer fields than your index key includes, then the index will not be used for a lookup, but will be fully scanned. For example, an index on
(x, y)cannot be used to execute
WHERE x = 7as a point lookup.
- If the
WHEREclause constrains more fields than your index key includes, then the index might still provide some speedup, but it won’t necessarily be optimal: In this case, the index lookup is performed using only those constraints that are included in the index key, and the rest of the constraints will be used to subsequently filter the result of the index lookup.
ORis used and not all arguments constrain the same fields, create an index for the intersection of the constrained fields. For example, if you have
WHERE (x = 51 AND y = 'bbb') OR (x = 76 AND z = 9), create an index just on
ORis used and its arguments constrain completely disjoint sets of fields (e.g.
WHERE x = 5 OR y = 'aaa'), try to rewrite your query using a
UNION ALL), where each argument of the
UNIONhas one of the original
In general, you can improve the performance of your joins by creating indexes on the columns occurring in join keys. This comes at the cost of additional memory usage. Materialize’s in-memory arrangements (the internal data structure of indexes) allow the system to share indexes across queries: for multiple queries, an index is a fixed upfront cost with memory savings for each new query that uses it.
Let’s create a few tables to work through examples.
CREATE TABLE teachers (id INT, name TEXT); CREATE TABLE sections (id INT, teacher_id INT, course_id INT, schedule TEXT); CREATE TABLE courses (id INT, name TEXT);
Multiple Queries Join On the Same Collection
Let’s consider two queries that join on a common collection. The idea is to create an index that can be shared across the two queries to save memory.
Here is a query where we join a collection
teachers to a collection
sections to see the name of the teacher, schedule, and course ID for a specific section of a course.
SELECT t.name, s.schedule, s.course_id FROM teachers t INNER JOIN sections s ON t.id = s.teacher_id;
Here is another query that also joins on
teachers.id. This one counts the number of sections each teacher teaches.
SELECT t.id, t.name, count(*) FROM teachers t INNER JOIN sections s ON t.id = s.teacher_id GROUP BY t.id, t.name;
We can eliminate redundant memory usage for these two queries by creating an index on the common column being joined,
CREATE INDEX pk_teachers ON teachers (id);
Joins with Filters
If your query filters one or more of the join inputs by a literal equality (e.g.,
WHERE t.name = 'Escalante'), place one of those input collections first in the
FROM clause. In particular, this can speed up ad hoc
SELECT queries by accessing collections using index lookups rather than full scans.
Note that when the same input is being used in a join as well as being constrained by equalities to literals, either the join or the literal equalities can be sped up by an index (possibly the same index, but usually different indexes). Which of these will perform better depends on the characteristics of your data. For example, the following query can make use of either of the following two indexes, but not both at the same time:
teachers(name)to perform the
t.name = 'Escalante'point lookup before the join,
teachers(id)to speed up the join and then perform the
WHERE t.name = 'Escalante'.
SELECT t.name, s.schedule, s.course_id FROM teachers t INNER JOIN sections s ON t.id = s.teacher_id WHERE t.name = 'Escalante';
In this case, the index on
teachers(name) might work better, as the
WHERE t.name = 'Escalante' can filter out a very large percentage of the
teachers table before the table is fed to the join. You can see an example
EXPLAIN command output for the above query here.
Optimize Multi-Way Joins with Delta Joins
Materialize has access to a join execution strategy we call
DeltaQuery, a.k.a. delta joins, that aggressively re-uses indexes and maintains no intermediate results. Materialize considers this plan only if all the necessary indexes already exist, in which case the additional memory cost of the join is zero. This is typically possible when you index all the join keys.
From the previous example, add the name of the course rather than just the course ID.
CREATE VIEW course_schedule AS SELECT t.name AS teacher_name, s.schedule, c.name AS course_name FROM teachers t INNER JOIN sections s ON t.id = s.teacher_id INNER JOIN courses c ON c.id = s.course_id;
In this case, we create indexes on the join keys to optimize the query:
CREATE INDEX pk_teachers ON teachers (id); CREATE INDEX sections_fk_teachers ON sections (teacher_id); CREATE INDEX pk_courses ON courses (id); CREATE INDEX sections_fk_courses ON sections (course_id);
EXPLAIN SELECT * FROM course_schedule;
Optimized Plan Explained Query: Project (#1, #5, #7) Filter (#0) IS NOT NULL AND (#4) IS NOT NULL Join on=(#0 = #3 AND #4 = #6) type=delta <---------- Delta join ArrangeBy keys=[[#0]] ReadIndex on=teachers pk_teachers=[delta join 1st input (full scan)] ArrangeBy keys=[[#1], [#2]] ReadIndex on=sections sections_fk_teachers=[delta join lookup] sections_fk_courses=[delta join lookup] ArrangeBy keys=[[#0]] ReadIndex on=courses pk_courses=[delta join lookup] Used Indexes: - materialize.public.pk_teachers (delta join 1st input (full scan)) - materialize.public.sections_fk_teachers (delta join lookup) - materialize.public.pk_courses (delta join lookup) - materialize.public.sections_fk_courses (delta join lookup)
For ad hoc
SELECT queries with a delta join, place the smallest input (taking into account predicates that filter from it) first in the
Further Optimize with Late Materialization
Materialize can further optimize memory usage when joining collections with primary and foreign key constraints using a pattern known as late materialization.
To understand late materialization, you need to know about primary and foreign keys. In our example, the
teachers.id column uniquely identifies all teachers. When a column or set of columns uniquely identifies each record, it is called a primary key. We also have
sections.teacher_id, which is not the primary key of
sections, but it does correspond to the primary key of
teachers. Whenever we have a column that is a primary key of another collection, it is called a foreign key.
In many relational databases, indexes don’t replicate the entire collection of data. Rather, they maintain just a mapping from the indexed columns back to a primary key. These few columns can take substantially less space than the whole collection, and may also change less as various unrelated attributes are updated. This is called late materialization, and it is possible to achieve in Materialize as well. Here are the steps to implementing late materialization along with examples.
Create indexes on the primary key column(s) for your input collections.
CREATE INDEX pk_teachers ON teachers (id); CREATE INDEX pk_sections ON sections (id); CREATE INDEX pk_courses ON courses (id);
For each foreign key in the join, create a “narrow” view with just two columns: foreign key and primary key. Then create two indexes: one for the foreign key and one for the primary key. In our example, the two foreign keys are
sections.course_id, so we do the following:
-- Create a "narrow" view containing primary key sections.id -- and foreign key sections.teacher_id CREATE VIEW sections_narrow_teachers AS SELECT id, teacher_id FROM sections; -- Create indexes on those columns CREATE INDEX sections_narrow_teachers_0 ON sections_narrow_teachers (id); CREATE INDEX sections_narrow_teachers_1 ON sections_narrow_teachers (teacher_id);
-- Create a "narrow" view containing primary key sections.id -- and foreign key sections.course_id CREATE VIEW sections_narrow_courses AS SELECT id, course_id FROM sections; -- Create indexes on those columns CREATE INDEX sections_narrow_courses_0 ON sections_narrow_courses (id); CREATE INDEX sections_narrow_courses_1 ON sections_narrow_courses (course_id);NOTE: In this case, because both foreign keys are in
sections, we could have gotten away with one narrow collection
sections_narrow_teachers_and_courseswith indexes on
course_id. In general, we won’t be so lucky to have all the foreign keys in the same collection, so we’ve shown the more general pattern of creating a narrow view and two indexes for each foreign key.
Rewrite your query to use your narrow collections in the join conditions. Example:
SELECT t.name AS teacher_name, s.schedule, c.name AS course_name FROM sections_narrow_teachers s_t INNER JOIN sections s ON s_t.id = s.id INNER JOIN teachers t ON s_t.teacher_id = t.id INNER JOIN sections_narrow_courses s_c ON s_c.id = s.id INNER JOIN courses c ON s_c.course_id = c.id;
Create a default index when there is no particular
JOIN clause that would fit the above cases. This can still speed up your query by reading the input from memory.
EXPLAIN to verify index usage
EXPLAIN to verify that indexes are used as you expect. For example:
CREATE TABLE teachers (id INT, name TEXT); CREATE TABLE sections (id INT, teacher_id INT, course_id INT, schedule TEXT); CREATE TABLE courses (id INT, name TEXT); CREATE INDEX pk_teachers ON teachers (id); CREATE INDEX teachers_name ON teachers (name); CREATE INDEX sections_fk_teachers ON sections (teacher_id); CREATE INDEX pk_courses ON courses (id); CREATE INDEX sections_fk_courses ON sections (course_id); EXPLAIN SELECT t.name AS teacher_name, s.schedule, c.name AS course_name FROM teachers t INNER JOIN sections s ON t.id = s.teacher_id INNER JOIN courses c ON c.id = s.course_id WHERE t.name = 'Escalante';
Optimized Plan ------------------------------------------------------------------------------------------------------------------ Explained Query: + Project (#1, #6, #8) + Filter (#0) IS NOT NULL AND (#5) IS NOT NULL + Join on=(#0 = #4 AND #5 = #7) type=delta + ArrangeBy keys=[[#0]] + ReadIndex on=materialize.public.teachers teachers_name=[lookup value=("Escalante")] + ArrangeBy keys=[[#1], [#2]] + ReadIndex on=sections sections_fk_teachers=[delta join lookup] sections_fk_courses=[delta join lookup]+ ArrangeBy keys=[[#0]] + ReadIndex on=courses pk_courses=[delta join lookup] + + Used Indexes: + - materialize.public.teachers_name (lookup) + - materialize.public.sections_fk_teachers (delta join lookup) + - materialize.public.pk_courses (delta join lookup) + - materialize.public.sections_fk_courses (delta join lookup) +
You can see in the above
EXPLAIN printout that the system will use
teachers_name for a point lookup, and use three other indexes for the execution of the delta join. Note that the
pk_teachers index is not used, as explained above.
The following are the possible index usage types:
*** full scan ***: Materialize will read the entire index.
lookup: Materialize will look up only specific keys in the index.
differential join: Materialize will use the index to perform a differential join. For a differential join between two relations, the amount of memory required is proportional to the sum of the sizes of each of the input relations that are not indexed. In other words, if an input is already indexed, then the size of that input won’t affect the memory usage of a differential join between two relations. For a join between more than two relations, we recommend aiming for a delta join instead of a differential join, as explained above. A differential join between more than two relations will perform a series of binary differential joins on top of each other, and each of these binary joins (except the first one) will use memory proportional to the size of the intermediate data that is fed into the join.
delta join 1st input (full scan): Materialize will use the index for the first input of a delta join. Note that the first input of a delta join is always fully scanned. However, executing the join won’t require additional memory if the input is indexed.
delta join lookup: Materialize will use the index for a non-first input of a delta join. This means that, in an ad hoc query, the join will perform only lookups into the index.
fast path limit: When a fast path query has a
LIMITclause but no
ORDER BYclause, then Materialize will read from the index only as many records as required to satisfy the
Materialize has at present three important query hints:
AGGREGATE INPUT GROUP SIZE,
DISTINCT ON INPUT GROUP SIZE, and
LIMIT INPUT GROUP SIZE. These hints apply to indexed or materialized views that need to incrementally maintain
MAX, or Top K queries, as specified by SQL aggregations,
DISTINCT ON, or
LIMIT clauses. Maintaining these queries while delivering low latency result updates is demanding in terms of main memory. This is because Materialize builds a hierarchy of aggregations so that data can be physically partitioned into small groups. By having only small groups at each level of the hierarchy, we can make sure that recomputing aggregations is not slowed down by skew in the sizes of the original query groups.
The number of levels needed in the hierarchical scheme is by default set assuming that there may be large query groups in the input data. By specifying the query hints, it is possible to refine this assumption, allowing Materialize to build a hierarchy with fewer levels and lower memory consumption without sacrificing update latency.
Consider the previous example with the collection
sections. Maintenance of the maximum
teacher can be achieved with a materialized view:
CREATE MATERIALIZED VIEW max_course_id_per_teacher AS SELECT teacher_id, MAX(course_id) FROM sections GROUP BY teacher_id;
If the largest number of
course_id values that are allocated to a single
teacher_id is known, then this number can be provided as the
AGGREGATE INPUT GROUP SIZE. For the query above, it is possible to get an estimate for this number by:
SELECT MAX(course_count) FROM ( SELECT teacher_id, COUNT(*) course_count FROM sections GROUP BY teacher_id );
However, the estimate is based only on data that is already present in the system. So taking into account how much this largest number could expand is critical to avoid issues with update latency after tuning the query hint.
For our example, let’s suppose that we determined the largest number of courses per teacher to be
1000. Then, the original definition of
max_course_id_per_teacher can be revised to include the
AGGREGATE INPUT GROUP SIZE query hint as follows:
CREATE MATERIALIZED VIEW max_course_id_per_teacher AS SELECT teacher_id, MAX(course_id) FROM sections GROUP BY teacher_id OPTIONS (AGGREGATE INPUT GROUP SIZE = 1000)
The other two hints can be provided in Top K query patterns specified by
DISTINCT ON or
LIMIT. As examples, consider that we wish not to compute the maximum
course_id, but rather the
id of the section of this top course. This computation can be incrementally maintained by the following materialized view:
CREATE MATERIALIZED VIEW section_of_top_course_per_teacher AS SELECT DISTINCT ON(teacher_id) teacher_id, id AS section_id FROM sections OPTIONS (DISTINCT ON INPUT GROUP SIZE = 1000) ORDER BY teacher_id ASC, course_id DESC;
In the above examples, we see that the query hints are always positioned in an
OPTIONS clause after a
GROUP BY clause, but before an
ORDER BY, as captured by the
SELECT syntax. However, in the case of Top K using a
LATERAL subquery and
LIMIT, it is important to note that the hint is specified in the subquery. For instance, the following materialized view illustrates how to incrementally maintain the top-3 section
ids ranked by
course_id for each teacher:
CREATE MATERIALIZED VIEW sections_of_top_3_courses_per_teacher AS SELECT id AS teacher_id, section_id FROM teachers grp, LATERAL (SELECT id AS section_id FROM sections WHERE teacher_id = grp.id OPTIONS (LIMIT INPUT GROUP SIZE = 1000) ORDER BY course_id DESC LIMIT 3);
For indexed and materialized views that have already been created without specifying query hints, Materialize includes an introspection view,
mz_internal.mz_expected_group_size_advice, that can be used to query, for a given cluster, all incrementally maintained dataflows where tuning of the above query hints could be beneficial. The introspection view also provides an advice value based on an estimate of how many levels could be cut from the hierarchy. The following query illustrates how to access this introspection view:
SELECT dataflow_name, region_name, levels, to_cut, hint FROM mz_internal.mz_expected_group_size_advice ORDER BY dataflow_name, region_name;
hint provides the estimated value to be provided to the
AGGREGATE INPUT GROUP SIZE in the case of a
MAX aggregation or to the
DISTINCT ON INPUT GROUP SIZE or
LIMIT INPUT GROUP SIZE in the case of a Top K pattern.
Check out the blog post Delta Joins and Late Materialization to go deeper on join optimization in Materialize.