Top-K in group

Overview

The “Top-K in group” query pattern groups by some key and return the first K elements within each group according to some ordering.

Materialize and window functions

For window functions, when an input record in a partition (as determined by the PARTITION BY clause of your window function) is added/removed/changed, Materialize recomputes the results for the entire window partition. This means that when a new batch of input data arrives (that is, every second), the amount of computation performed is proportional to the total size of the touched partitions.

For example, assume that in a given second, 20 input records change, and these records belong to 10 different partitions, where the average size of each partition is 100. Then, amount of work to perform is proportional to computing the window function results for 10*100=1000 rows.

As a rule of thumb, if the total size of all touched window partitions is at most 1000000 rows per second, then the system should be able to keep up with the input data as it arrives. However, if your use case has higher performance requirements, consider rewriting your query to not use window functions. If your query cannot be rewritten without the window functions and the performance of window functions is insufficient for your use case, please contact our team.

Idiomatic Materialize SQL

For K >= 1

Idiomatic Materialize SQL: For Top-K queries where K >= 1, use a subquery to SELECT DISTINCT on the grouping key and perform a LATERAL join (by the grouping key) with another subquery that specifies the ordering and the limit K.

Idiomatic Materialize SQL

Use a subquery to SELECT DISTINCT on the grouping key (e.g., fieldA), and perform a LATERAL join (by the grouping key fieldA) with another subquery that specifies the ordering (e.g., fieldZ [ASC|DESC]) and the limit K.

SELECT fieldA, fieldB, ...
FROM (SELECT DISTINCT fieldA FROM tableA) grp,
     LATERAL (SELECT fieldB, ... , fieldZ FROM tableA
        WHERE fieldA = grp.fieldA
        ORDER BY fieldZ ... LIMIT K)   -- K is a number >= 1
ORDER BY fieldA, fieldZ ... ;
Anti-pattern

Avoid the use of ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for Top-K queries.


-- Anti-pattern. Avoid. --
SELECT fieldA, fieldB, ...
FROM (
   SELECT fieldA, fieldB, ... , fieldZ,
      ROW_NUMBER() OVER (PARTITION BY fieldA
      ORDER BY fieldZ ... ) as rn
   FROM tableA)
WHERE rn <= K     -- K is a number >= 1
ORDER BY fieldA, fieldZ ...;

Query hints

To further improve the memory usage of the idiomatic Materialize SQL, you can specify a LIMIT INPUT GROUP SIZE query hint in the idiomatic Materialize SQL.

SELECT fieldA, fieldB, ...
FROM (SELECT DISTINCT fieldA FROM tableA) grp,
     LATERAL (SELECT fieldB, ... , fieldZ FROM tableA
        WHERE fieldA = grp.fieldA
        OPTIONS (LIMIT INPUT GROUP SIZE = ...)
        ORDER BY fieldZ ... LIMIT K)   -- K is a number >= 1
ORDER BY fieldA, fieldZ ... ;

For more information on setting LIMIT INPUT GROUP SIZE, see Optimization.

For K = 1

Idiomatic Materialize SQL: For K = 1, use a SELECT DISTINCT ON() on the grouping key (e.g., fieldA) and order the results first by the DISTINCT ON key and then the Top-K ordering key the (e.g., fieldA, fieldZ [ASC|DESC]).

Alternatively, you can also use the more general Top-K where K >= 1 pattern, specifying 1 as the limit.

Idiomatic Materialize SQL
SELECT DISTINCT ON(fieldA) fieldA, fieldB, ...
FROM tableA
ORDER BY fieldA, fieldZ ... ;
Anti-pattern

Avoid the use of ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for Top-K queries.


-- Anti-pattern. Avoid. --
SELECT fieldA, fieldB, ...
FROM (
   SELECT fieldA, fieldB, ... , fieldZ,
      ROW_NUMBER() OVER (PARTITION BY fieldA
      ORDER BY fieldZ ... ) as rn
   FROM tableA)
WHERE rn = 1
ORDER BY fieldA, fieldZ ...;

Query hints

To further improve the memory usage of the idiomatic Materialize SQL, you can specify a DISTINCT ON INPUT GROUP SIZE query hint in the idiomatic Materialize SQL.

SELECT DISTINCT ON(fieldA) fieldA, fieldB, ...
FROM tableA
OPTIONS (DISTINCT ON INPUT GROUP SIZE = ...)
ORDER BY fieldA, fieldZ ... ;

For more information on setting DISTINCT ON INPUT GROUP SIZE, see Optimization.

Examples

NOTE: The example data can be found in the Appendix.

Select Top-3 items

Using idiomatic Materialize SQL, the following example finds the top 3 items (by descending subtotal) in each order. The example uses a subquery to SELECT DISTINCT on the grouping key (order_id), and performs a LATERAL join (by the grouping key) with another subquery that specifies the ordering (ORDER BY subtotal DESC) and limits its results to 3 (LIMIT 3).

Idiomatic Materialize SQL
SELECT order_id, item, subtotal
FROM (SELECT DISTINCT order_id FROM orders_view) grp,
     LATERAL (SELECT item, subtotal FROM orders_view
        WHERE order_id = grp.order_id
        ORDER BY subtotal DESC LIMIT 3)
ORDER BY order_id, subtotal DESC;
Anti-pattern

Avoid the use of ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for Top-K queries.


-- Anti-pattern --
SELECT order_id, item, subtotal
FROM (
   SELECT order_id, item, subtotal,
      ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY subtotal DESC) as rn
   FROM orders_view)
WHERE rn <= 3
ORDER BY order_id, subtotal DESC;

Select Top-1 item

Using idiomatic Materialize SQL, the following example finds the top 1 item (by descending subtotal) in each order. The example uses a query to SELECT DISTINCT ON() on the grouping key (order_id) with an ORDER BY order_id, subtotal DESC (i.e., ordering first by the DISTINCT ON/grouping key, then the descending subtotal). 1

Idiomatic Materialize SQL
SELECT DISTINCT ON(order_id) order_id, item, subtotal
FROM orders_view
ORDER BY order_id, subtotal DESC;
Anti-pattern

Avoid the use of ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for Top-K queries.


-- Anti-pattern --
SELECT order_id, item, subtotal
FROM (
   SELECT order_id, item, subtotal,
      ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY subtotal DESC) as rn
   FROM orders_view)
WHERE rn = 1
ORDER BY order_id, subtotal DESC;

See also


  1. Alternatively, you can also use the idiomatic Materialize SQL for the more general Top K query, specifying 1 as the limit. ↩︎

Back to top ↑