# Percentile calculation

Percentiles are a useful statistic to understand and interpret data distribution. This pattern covers how to use histograms to efficiently calculate percentiles in Materialize.

A naive way to compute percentiles is to order all values and pick the value at the position of the corresponding percentile. This way of computing percentiles keeps all values around and therefore demands memory to grow linearly with the number of tracked values. Two better approaches to reduce the memory footprint are: histograms and High Dynamic Range (HDR) histograms.

Histograms have a lower memory footprint, linear to the number of unique values, and computes precise percentiles. HDR histograms further reduce the memory footprint but at the expense of computing approximate percentiles. They are particularly interesting if there is a long tail of large values that you want to track, which is often the case for latency measurements.

## Using histograms to compute percentiles

Histograms reduce the memory footprint by tracking a count for each unique value, in a `bucket`, instead of tracking all values. Given an `input`, define the histogram for `values` as follows:

``````CREATE VIEW histogram AS
SELECT
value AS bucket,
count(*) AS frequency
FROM input
GROUP BY value;
``````

To query percentiles from the view `histogram`, it’s no longer possible to just order the `values` and pick a value from the right spot. Instead, the distribution of values needs to be reconstructed. It is done by determining the cumulative count (the sum of counts up through each bucket that came before) for each `bucket`. This is accomplished through a cross-join in the following view:

``````CREATE VIEW distribution AS
SELECT
h.bucket,
h.frequency,
sum(g.frequency) AS cumulative_frequency,
sum(g.frequency) / (SELECT sum(frequency) FROM histogram) AS cumulative_distribution
FROM histogram g, histogram h
WHERE g.bucket <= h.bucket
GROUP BY h.bucket, h.frequency
ORDER BY cumulative_distribution;
``````

The cumulative count and the cumulative distribution can then be used to query for arbitrary percentiles. The following query returns the 90-th percentile.

``````SELECT bucket AS percentile
FROM distribution
WHERE cumulative_distribution >= 0.9
ORDER BY cumulative_distribution
LIMIT 1;
``````

To increase query performance, it can make sense to keep the `distribution` always up to date by creating an index on the view:

``````CREATE INDEX distribution_idx ON distribution (cumulative_distribution);
``````

Histograms work well for a domain with low cardinality. But note the cross join in the definition of `distribution`. As a consequence, the required memory grows quadratic in the number of unique values and may therefore grow large if the domain has many unique values.

## Using HDR histograms to compute approximate percentiles

HDR histograms can be used to approximate percentiles in a space efficient manner that scales well even for large domains with many distinct values. HDR histograms reduce the precision of values that are tracked and use buckets with variable width. Buckets that are closer to 0 are smaller whereas buckets far away from 0 are wider. This works particularly well for data that exhibits a long tail of large values, e.g., latency measurements.

HDR histograms are related to how floating point numbers are represented as integers. The underlying assumption is that smaller numbers require a higher precision to be distinguishable (e.g. 5 ms and 6 ms are different and should be in different buckets) whereas larger numbers can be rounded more aggressively as their relative error becomes less relevant (e.g. 10000 ms and 10001 ms are almost the same and can reside in the same bucket).

The following snippets reduce the precision of numbers to limit the amount of required buckets. Numbers are decomposed into their floating point representation

``````n = sign * mantissa * 2**exponent
``````

and the precision of the mantissa is then reduced to compute the respective bucket.

The basic ideas of using histograms to compute percentiles remain the same; but determining the bucket becomes more involved because itâ€™s now composed of the triple (sign, mantissa, exponent).

``````-- precision for the representation of the mantissa in bits
\set precision 4

CREATE VIEW hdr_histogram AS
SELECT
CASE WHEN value<0 THEN -1 ELSE 1 END AS sign,
trunc(log(2.0, abs(value)))::int AS exponent,
trunc(pow(2.0, :precision) * (value / pow(2.0, trunc(log(2.0, abs(value)))::int) - 1.0))::int AS mantissa,
count(*) AS frequency
FROM input
GROUP BY sign, exponent, mantissa;
``````

The `hdr_distribution` view below reconstructs the `bucket` (with reduced precision), and determines the cumulative count and cumulative distribution.

``````CREATE VIEW hdr_distribution AS
SELECT
h.sign*(1.0+h.mantissa/pow(2.0, :precision))*pow(2.0,h.exponent) AS bucket,
h.frequency,
sum(g.frequency) AS cumulative_frequency,
sum(g.frequency) / (SELECT sum(frequency) FROM hdr_histogram) AS cumulative_distribution
FROM hdr_histogram g, hdr_histogram h
WHERE (g.sign,g.exponent,g.mantissa) <= (h.sign,h.exponent,h.mantissa)
GROUP BY h.sign, h.exponent, h.mantissa, h.frequency
ORDER BY cumulative_distribution;
``````

This view can then be used to query approximate percentiles. More precisely, the query returns the lower bound for the percentile (the next larger bucket represents the upper bound).

``````SELECT bucket AS approximate_percentile
FROM hdr_distribution
WHERE cumulative_distribution >= 0.9
ORDER BY cumulative_distribution
LIMIT 1;
``````

As with histograms, increase query performance by creating an index on the `cumulative_distribution` column.

``````CREATE INDEX hdr_distribution_idx ON hdr_distribution (cumulative_distribution);
``````

## Examples

``````CREATE TABLE input (value BIGINT);
``````

Let’s add the values 1 to 10 into the `input` table.

``````INSERT INTO input SELECT n FROM generate_series(1,10) AS n;
``````

For small numbers, `distribution` and `hdr_distribution` are identical. Even in `hdr_distribution`, all numbers from 1 to 10 are stored in their own buckets.

``````SELECT * FROM hdr_distribution;

bucket | frequency | cumulative_frequency | cumulative_distribution
--------+-----------+----------------------+-------------------------
1 |         1 |                    1 |                     0.1
2 |         1 |                    2 |                     0.2
3 |         1 |                    3 |                     0.3
4 |         1 |                    4 |                     0.4
5 |         1 |                    5 |                     0.5
6 |         1 |                    6 |                     0.6
7 |         1 |                    7 |                     0.7
8 |         1 |                    8 |                     0.8
9 |         1 |                    9 |                     0.9
10 |         1 |                   10 |                       1
(10 rows)
``````

But if values grow larger, buckets can contain more than one value. Let’s see what happens if more values are added to the `input` table.

``````INSERT INTO input SELECT n FROM generate_series(11,10001) AS n;
``````

In the case of the `hdr_distribution`, a single bucket represents up to 512 distinct values, whereas each bucket of the `distribution` contains only a single value.

``````SELECT * FROM hdr_distribution ORDER BY cumulative_distribution;

bucket | frequency | cumulative_frequency |          cumulative_distribution
--------+-----------+----------------------+-------------------------------------------
1 |         1 |                    1 |     0.00000999990000099999000009999900001
2 |         1 |                    2 |     0.00001999980000199998000019999800002
3 |         1 |                    3 |     0.00002999970000299997000029999700003
4 |         1 |                    4 |     0.00003999960000399996000039999600004
5 |         1 |                    5 |     0.00004999950000499995000049999500005
...skipping...
7424 |       256 |                 7679 | 0.767823217678232176782321767823217678232
7680 |       256 |                 7935 | 0.793420657934206579342065793420657934207
7936 |       256 |                 8191 | 0.819018098190180981901809819018098190181
8192 |       512 |                 8703 |  0.87021297870212978702129787021297870213
8704 |       512 |                 9215 | 0.921407859214078592140785921407859214079
9216 |       512 |                 9727 | 0.972602739726027397260273972602739726027
9728 |       274 |                10001 |                                         1
(163 rows)
``````

Note that `hdr_distribution` only contains 163 rows as opposed to the 10001 rows of `distribution`, which is used in the histogram approach. However, when querying for the 90-th percentile, the query returns an approximate percentile of `8704` (or more precisely between `8704`and `9216`) whereas the precise percentile is `9001`.

``````SELECT bucket AS approximate_percentile
FROM hdr_distribution
WHERE cumulative_distribution >= 0.9
ORDER BY cumulative_distribution
LIMIT 1;

approximate_percentile
------------------------
8704
(1 row)
``````

The precision of the approximation can be adapted by changing the `precision` in the definition of `hdr_histogram`. The higher the `precision`, the fewer items are kept in the same bucket and therefore the more precise the approximate percentile becomes. The lower the `precision`, the more items are kept in the same bucket and therefore the less memory is required.