Percentile calculation
Percentiles are a useful statistic to understand and interpret data distribution. This pattern covers how to use histograms to efficiently calculate percentiles in Materialize.
One way to compute percentiles is to order all values and pick the value at the position of the corresponding percentile. However, this approach requires storing all values, causing memory to grow linearly with the number of tracked values.
Instead, more memory efficient alternatives are to use histograms and High Dynamic Range (HDR) histograms:
-
Histograms have a lower memory footprint that is linear to the number of unique values and can compute precise percentiles. However, for domains with high cardinality, calculating precise percentiles may be computationally expensive.
-
HDR histograms further reduce the memory footprint but computes approximate percentiles. Depending on the precision needed for the percentiles, HDR histograms may be preferred for domains with a high cardinality and dynamic range of values.
Using histograms to compute exact percentiles
Histograms summarize data sets by grouping values into ranges and counting how many elements fall into each range. From this summary, you can get the percentile information by identifying the range where the cumulative count crosses the desired percentile threshold. By grouping each distinct value into its own range, you can get exact percentiles; however, this can be computationally expensive if there are large number of distinct values. Alternatively, you can get an approximate percentiles by using HDR histograms.
To use histograms to compute exact percentiles:
-
First, create a histogram view that groups each distinct value into its own bucket and counts the number of each distinct value.
-
Then, using a cross join on the histogram view, create a distribution view that calculates the cumulative density for a bucket by dividing the cumulative counts (sum of the counts for all buckets up to and including that bucket) by the total count.
NOTE: The use of the cross join produces a number of outputs that is quadratic in the input. And, while the results will only be linear in size, it may take a disproportionate amount of time to produce and maintain.
Example
-
Create a table
input
:CREATE TABLE input (value BIGINT);
-
Insert into the
input
table values1
to10
.INSERT INTO input SELECT n FROM generate_series(1,10) AS n;
-
Create a
histogram
view to track unique values from theinput
table and their count:CREATE VIEW histogram AS SELECT value AS bucket, count(*) AS count_of_bucket_values FROM input GROUP BY value;
-
Create a view
distribution
to calculate the cumulative count and the cumulative density for each bucket. The cumulative density is calculated by dividing the cumulative count for a bucket (i.e., count for all bucket values up to and including that bucket) by the total count.CREATE VIEW distribution AS SELECT h.bucket, h.count_of_bucket_values, sum(g.count_of_bucket_values) AS cumulative_count, sum(g.count_of_bucket_values) / (SELECT sum(count_of_bucket_values) FROM histogram) AS cumulative_density FROM histogram g, histogram h WHERE g.bucket <= h.bucket GROUP BY h.bucket, h.count_of_bucket_values ORDER BY cumulative_density;
NOTE: The use of the cross join produces a number of outputs that is quadratic in the input. And, while the results will only be linear in size, it may take a disproportionate amount of time to produce and maintain. -
You can then query
distribution
by thecumulative_density
field to return specific percentiles. For example, the following query returns the 90-th percentile.SELECT bucket AS percentile90 FROM distribution WHERE cumulative_density >= 0.9 ORDER BY cumulative_density LIMIT 1;
Using HDR histograms to compute approximate percentiles
HDR histograms can be used to approximate percentiles in a space efficient manner that scales well even for large domains with many distinct values. HDR histograms reduce the precision of values that are tracked and use buckets with variable width. Buckets that are closer to 0 are smaller whereas buckets far away from 0 are wider. This works particularly well for data that exhibits a long tail of large values, e.g., latency measurements.
HDR histograms are related to how floating point numbers are represented as integers. The underlying assumption is that smaller numbers require a higher precision to be distinguishable (e.g. 5 ms and 6 ms are different and should be in different buckets) whereas larger numbers can be rounded more aggressively as their relative error becomes less relevant (e.g. 10000 ms and 10001 ms are almost the same and can reside in the same bucket).
In the example below, to reduce the number of buckets, the values are first
decomposed into significand * 2^exponent
, and then with the precision of the
significand lowered, reconstructed for the respective bucket value.
-
With higher precisions, fewer items are kept in the same bucket and thus, more memory is required, but the approximate percentile becomes more precise.
-
With lower precisions, more items are kept in the same bucket, and thus, the less memory is required, but the approximate percentile becomes less precise.
Except for the bucket calculation, the basic ideas of using histograms to compute percentiles remains the same for HDR histograms.
Example
input
table from the Using histograms to compute exact percentiles
example. If you have created and populated the table, skip the
corresponding steps.
-
Create a table
input
:CREATE TABLE input (value BIGINT);
-
Insert into the
input
table values1
to10
.INSERT INTO input SELECT n FROM generate_series(1,10) AS n;
-
Create a
hdr_histogram
view. To reduce the number of buckets, the values are rounded down to the nearest multiple of 1/16. Specifically, the values are first decomposed intosignificand * 2^exponent
. Then by reducing the precision of the significand to 1/16 (4 bits), the value is reconstructed to an approximated value.CREATE VIEW hdr_histogram AS WITH input_parts AS ( SELECT CASE WHEN value = 0 THEN NULL ELSE trunc(log(2, abs(value)))::int END AS exponent, CASE WHEN value = 0 THEN NULL ELSE value / pow(2.0, trunc(log(2, abs(value)))::int) END AS significand FROM input ), buckets AS ( -- reduce precision by 4 bits to round down the value to the nearest multiple of 1/16 SELECT trunc(significand * pow(2.0, 4)) / pow(2.0, 4) * pow(2.0, exponent) AS bucket FROM input_parts ) SELECT COALESCE(bucket, 0) AS bucket, count(*) AS count_of_bucket_values FROM buckets GROUP BY bucket;
-- precision for the representation of the significand in bits \set precision 4 CREATE VIEW hdr_histogram AS WITH input_parts AS ( SELECT CASE WHEN value = 0 THEN NULL ELSE trunc(log(2, abs(value)))::int END AS exponent, CASE WHEN value = 0 THEN NULL ELSE value / pow(2.0, trunc(log(2, abs(value)))::int) END AS significand FROM input ), buckets AS ( -- reduce precision by 4 bits to round down the value to the nearest multiple of 1/16 SELECT trunc(significand * pow(2.0, :precision)) / pow(2.0, :precision) * pow(2.0, exponent) AS bucket FROM input_parts ) SELECT COALESCE(bucket, 0) AS bucket, count(*) AS count_of_bucket_values FROM buckets GROUP BY bucket;
-
Create a view
hdr_distribution
to calculate the cumulative count and the cumulative density for each bucket. The cumulative density is calculated by dividing the cumulative count for a bucket (i.e., count for all bucket values up to and including that bucket) by the total count.CREATE VIEW hdr_distribution AS SELECT h.bucket, h.count_of_bucket_values, sum(g.count_of_bucket_values) AS cumulative_count, sum(g.count_of_bucket_values) / (SELECT sum(count_of_bucket_values) FROM hdr_histogram) AS cumulative_density FROM hdr_histogram g, hdr_histogram h WHERE g.bucket <= h.bucket GROUP BY h.bucket, h.count_of_bucket_values;
-
You can then query
hdr_distribution
by thecumulative_density
field to return approximate percentiles. More precisely, the query returns the lower bound for the percentile (the next larger bucket represents the upper bound).For example, the following query returns the lower bound for the 90-th percentile.
SELECT bucket AS approximate_percentile FROM hdr_distribution WHERE cumulative_density >= 0.9 ORDER BY cumulative_density LIMIT 1;
HDR Histograms and approximate values
For small numbers, distribution
and hdr_distribution
are identical. Even in
hdr_distribution
, all numbers from 1 to 10 are stored in their own buckets. To
verify, query hdr_distribution
:
SELECT * FROM hdr_distribution;
The query returns the following:
bucket | frequency | cumulative_count | cumulative_density
--------+-----------+----------------------+-------------------------
1 | 1 | 1 | 0.1
2 | 1 | 2 | 0.2
3 | 1 | 3 | 0.3
4 | 1 | 4 | 0.4
5 | 1 | 5 | 0.5
6 | 1 | 6 | 0.6
7 | 1 | 7 | 0.7
8 | 1 | 8 | 0.8
9 | 1 | 9 | 0.9
10 | 1 | 10 | 1
(10 rows)
But if values grow larger, buckets can contain more than one value. Let’s see what happens if more values are added to the input
table.
INSERT INTO input SELECT n FROM generate_series(11,10001) AS n;
Unlike the distribution
view (used in the histogram approach) where each
bucket contains only a single value and has 10001 rows, a single bucket in
hdr_distribution
can represent up to 512 distinct values and has 163 rows:
SELECT * FROM hdr_distribution ORDER BY cumulative_density;
The query returns the following:
bucket | frequency | cumulative_count | cumulative_density
--------+-----------+----------------------+-------------------------------------------
1 | 1 | 1 | 0.00000999990000099999000009999900001
2 | 1 | 2 | 0.00001999980000199998000019999800002
3 | 1 | 3 | 0.00002999970000299997000029999700003
4 | 1 | 4 | 0.00003999960000399996000039999600004
5 | 1 | 5 | 0.00004999950000499995000049999500005
...skipping...
7424 | 256 | 7679 | 0.767823217678232176782321767823217678232
7680 | 256 | 7935 | 0.793420657934206579342065793420657934207
7936 | 256 | 8191 | 0.819018098190180981901809819018098190181
8192 | 512 | 8703 | 0.87021297870212978702129787021297870213
8704 | 512 | 9215 | 0.921407859214078592140785921407859214079
9216 | 512 | 9727 | 0.972602739726027397260273972602739726027
9728 | 274 | 10001 | 1
(163 rows)
When querying hdr_distribution
for the 90-th percentile value:
SELECT bucket AS approximate_percentile
FROM hdr_distribution
WHERE cumulative_density >= 0.9
ORDER BY cumulative_density
LIMIT 1;
The query returns an approximate
percentile of 8704
(or more precisely between 8704
and 9216
) whereas the
precise percentile is 9001
.
approximate_percentile
------------------------
8704
(1 row)
The precision of the approximation can be adapted by changing the precision
in the definition of hdr_histogram
. The higher the precision
, the fewer items are kept in the same bucket and therefore the more precise the approximate percentile becomes. The lower the precision
, the more items are kept in the same bucket and therefore the less memory is required.