Top K by group
Top K using a LATERAL
subquery and LIMIT
Suppose you want to group rows in a table by some key, then filter out all but
the first K elements within each group according to some ordering. In other
databases, you might use window functions. In Materialize, we recommend using a
LATERAL
subquery. The general form of the
query looks like this:
SELECT * FROM
(SELECT DISTINCT key_col FROM tbl) grp,
LATERAL (
SELECT col1, col2... FROM tbl
WHERE key_col = grp.key_col
ORDER BY order_col LIMIT k
)
For example, suppose you have a relation containing the population of various U.S. cities.
CREATE TABLE cities (
name text NOT NULL,
state text NOT NULL,
pop int NOT NULL
);
INSERT INTO cities VALUES
('Los_Angeles', 'CA', 3979576),
('Phoenix', 'AZ', 1680992),
('Houston', 'TX', 2320268),
('San_Diego', 'CA', 1423851),
('San_Francisco', 'CA', 881549),
('New_York', 'NY', 8336817),
('Dallas', 'TX', 1343573),
('San_Antonio', 'TX', 1547253),
('San_Jose', 'CA', 1021795),
('Chicago', 'IL', 2695598),
('Austin', 'TX', 978908);
To fetch the three most populous cities in each state:
SELECT state, name FROM
(SELECT DISTINCT state FROM cities) grp,
LATERAL (
SELECT name FROM cities
WHERE state = grp.state
ORDER BY pop DESC LIMIT 3
);
AZ Phoenix
CA Los_Angeles
CA San_Diego
CA San_Jose
IL Chicago
NY New_York
TX Houston
TX San_Antonio
TX Dallas
Despite the verbosity of the above query, Materialize produces a straightforward plan:
EXPLAIN SELECT state, name FROM ...
Explained Query:
Project (#1, #0)
TopK group_by=[#1] order_by=[#2 desc nulls_first] limit=3
ReadStorage materialize.public.cities
Top 1 using DISTINCT ON
If K = 1, i.e., you would like to see only the most populous city in each state, another approach is to use DISTINCT ON
:
SELECT DISTINCT ON(state) state, name
FROM cities
ORDER BY state, pop DESC;
Note that the ORDER BY
clause should start with the expressions that are in the DISTINCT ON
.
Group size hints
When using either the above LATERAL
subquery pattern or DISTINCT ON
, we recommend
specifying query hints to improve memory usage. For example:
SELECT state, name FROM
(SELECT DISTINCT state FROM cities) grp,
LATERAL (
SELECT name FROM cities
WHERE state = grp.state
OPTIONS (LIMIT INPUT GROUP SIZE = 1000)
ORDER BY pop DESC LIMIT 3
);
or
SELECT DISTINCT ON(state) state, name
FROM cities
OPTIONS (DISTINCT ON INPUT GROUP SIZE = 1000)
ORDER BY state, pop DESC;
To learn more about how to set the values for the query hints above, see the Optimization page.