If you want to learn more about Citus on Microsoft Azure, read this post about Hyperscale (Citus) on Azure Database for PostgreSQL.

*This diary-of-an-engineer post from our Citus open source engineering team was originally published on the Azure Postgres Blog.*

When working on the internals of Citus, an open source extension to Postgres that transforms Postgres into a distributed database, we often get to talk with customers that have interesting challenges you won’t find everywhere. Just a few months back, I encountered an analytics workload that was a really good fit for Citus.

But we had one problem: the percentile calculations on their data (over 300 TB of data) could not meet their SLA of 30 seconds.

To make things worse, the query performance was not even close to the target: the percentile calculations were taking about 6 minutes instead of the required 30 second SLA.

Figuring out how to meet the 30 second Postgres query SLA was a challenge because we didn’t have access to our customer’s data—and also because my customer didn’t have the cycles to compare the performance for different approaches I was considering. So we had to find ways to *estimate* which types of percentile calculations would meet their SLA, without having to spend the engineering cycles to implement different approaches.

This post explores how—with the help of the Postgres open source community—I was able to reduce the time to calculate percentiles by 45x by using the t-digest extension to Postgres.

## Importance of calculating percentiles in analytics workloads

My customer operates a multi datacenter web application with a real-time analytics dashboard that displays statistics about a variety of signals—and they store the analytics data in Hyperscale (Citus) on our Azure Database for PostgreSQL managed service. They ingest over 2 TB of data per hour and needed to get < 30 second performance for their queries over a 7-day period This analytics dashboard is used by their engineers to debug and root cause customer-reported issues. So they query metrics like latency, status codes, and error codes based on dimensions such as region, browser, data center, and the like.

Latency is of course an important metric for understanding these types of issues. However, average latency can be very misleading—which is where percentiles come in. If 1% of your users are experiencing super slow response times, the average query response time may not change much, leading you to (incorrectly) think that nothing is wrong. However, you would see a notable difference in P99, allowing you to isolate issues much faster.

Which is why metrics like P99 are so important when monitoring analytics workloads. A P99 query response time of 500ms means that the response time for 99% of your queries are faster than 500ms.

## Native percentile functions in Postgres didn’t do the trick

Postgres provides native support for selecting the value of a column at a certain percentile with the ordered-set aggregate functions:

`percentile_cont`

`percentile_disc`

Having native support for percentile calculations in Postgres is super powerful: all you have to do is specify you want the percentile and then you can let Postgres figure out how to get it.

And yet… percentiles can’t be sped up by indexes in Postgres.

When diving into the implementation of `percentile_cont`

in Postgres, you will find the transition functions for these aggregates collect, store, and sort the data internally. Meaning Postgres cannot rely on a known sort order of an index to significantly speed up the calculation.

Add on top that, when Postgres is sharded across multiple nodes by Citus and we want the percentile over a sharded dataset, we cannot combine the percentiles we get back from the distributed shards and expect a correct result. Instead Citus requires all rows to be sent to one location, sorted locally, and find the value at the requested percentile. The copying of data takes the most time in a distributed Citus cluster and this was in fact the problem for our customer: out of the 6 minutes for the query, 4½ minutes were spent in just pulling data to the coordinator.

To speed up the percentile calculations with Citus, we needed to reduce the amount of data required at the Citus coordinator to get to a reasonable percentile.

Lucky for us this was not the first time we had needed to find an effective way to run complex SQL calculations—at scale. In the past we have written about similar challenges with COUNT(DISTINCT) queries—which we solved with the HyperLogLog approximation algorithm, also known as HLL.

Hence this epiphany: Using a trusted, mathematically-robust approximation algorithm (sometimes called a sketch algorithm) might help us speed up performance.

## Which percentile approximation technique to use with Postgres?

A quick exploration in the space of percentile approximation pinpointed 3 different Postgres extensions we could consider. Two of these Postgres extensions had been created by Citus engineering interns in years past—and while not yet finished and not yet open source, these two prototype extensions deserved consideration:

- High Dynamic Range Histograms, HDR for short
- t-digest (the one created by our Citus intern)

And one open source t-digest extension contributed by Postgres committer Tomas Vondra:

With 3 competing extensions to Postgres that were all suitable for the approximation the question changes from how to, into the question of which one to use? To decide between the 3 different extensions for approximating percentiles, we needed to answer these 2 questions for each option:

- How fast would the execution of our customer be?
- How production-ready is the extension?

And we wanted to be able to answer these questions without having to integrate each of the approximation algorithms with Citus—and without having to ask our customer to measure how fast each performed.

## The challenge: Figuring out if approximation algorithms would meet the SLA, without having to implement & measure all the approaches

Getting to our customer SLA of 30 seconds from over 6 minutes was a hard task. And before doing too much engineering work, we needed to assess if calculating the percentiles using one of these Postgres extensions like t-digest or HDR was going to hit the 30 second SLA.

Given that COUNT DISTINCT had already been solved with HyperLogLog, we realized we could use a query that triggers the HLL code path to establish a ballpark figure on time required for the calculation, omitting some computational details.

To get this information, we asked our customer for the Postgres EXPLAIN plans of the original percentile queries and counting the distinct number of latencies (with HLL). Both were grouped by the desired time interval—minutes of a day—which resulted in 1440 groups.

Type of calculation with customer data (based on our customer’s Postgres EXPLAIN plans) | Execution Time |
---|---|

percentile_cont | 361 seconds (~6 minutes) |

COUNT DISTINCT – via HyperLogLog approximation | 4 seconds |

Even though the COUNT DISTINCT via HLL above gives an answer that tells us nothing, the execution characteristics for approximating the counts is very similar on how we would approximate the percentiles.

The shared execution characteristics between COUNT DISTINCT via HLL and approximating percentiles are:

- to iterate over all values,
- change a summary we keep,
- send the summaries to the Citus coordinator, and finally
- combine the summaries to come to an answer.

From the COUNT DISTINCT via HLL row in the table above, you can see that using an approximation algorithm (and maintaining the summary) seemed to get us in the ballpark of the desired execution times. But it was unclear how the time it would take to do percentile approximations compares to the time it takes to approximate COUNT DISTINCT.

We needed to figure out, would it be 2X or 20X more expensive to approximate percentiles vs. to approximate COUNT DISTINCTs?

The good news: to get a multiplier for the compute required, we did not have to rely on the environment of our customer.

Instead we set a baseline by running roughly the same approximation. In this case, the number of rows and the number of expected groups, could give a good indication. By running both a count distinct approximation and a percentile approximation, we were able to measure how much more work one requires over the other on the same data. This multiplier also shed light on whether we would be able to meet the target execution time of 30 seconds.

We used these queries to estimate the performance:

```
WITH data AS (
SELECT
(random()*60*24)::integer AS minute,
(random()*60000)::integer AS latency
FROM generate_series(1,1000000)
)
SELECT
minute,
aggregate_to_create_summary(latency) AS summary
FROM data
GROUP BY minute;
```

Executing these experiments gave us a workload multiplier which we applied to the runtime as measured at the target infrastructure. We ran these experiments for the 3 extensions we had identified earlier (t-digest, t-digest, and HDR) and the HyperLogLog (HLL) extension used for count approximations. The multipliers in the table below are against the baseline of the HLL extension.

Approximation algorithm | Runtime | Multiplier vs. HLL (lower means faster) |
---|---|---|

HDR – prototype | 2023 ms | ~2x |

t-digest – prototype | 4089 ms | ~4x |

t-digest – open source | 1590 ms | ~1.5x |

HyperLogLog (baseline) | 1026 ms | 1 |

## Data sizes matter

Besides compute time, the size of the data being transferred from the shards to the coordinator has a big influence on the total execution time. That is why we started this investigation in the first place: because transferring all rows to the coordinator had been taking the majority of the 6 minutes. Due to simplicity and portability, Citus uses the text-based protocol for transferring rows from the workers to the coordinator. To get an idea of data sizes involved, we can cast the summaries created above to text and sum their lengths to get an idea of the size of data transferred. This is again not an exact size but more an order of magnitude check.

To get an estimated guess on how long the final execution would take using t-digest, we compared the transfer sizes and then we used a multiplier on the transfer times (measured in the target environment.)

```
WITH data AS (
SELECT
(random()*60*24)::integer AS minute,
(random()*60000)::integer AS latency
FROM generate_series(1,1000000)
),
summary_sizes AS (
SELECT
minute,
octet_length(aggregate_to_create_summary(latency)::text) AS percentile_size
FROM data
GROUP BY minute
)
SELECT sum(percentile_size)*200 FROM summary_sizes;
```

Postgres Extension | Amount of data transferred to Citus coordinator | Multiplier (lower is better) |
---|---|---|

HDR prototype | 3.8 GB | ~6x |

t-digest prototype | 1.7 GB | ~2.7x |

t-digest (open source) | 218 MB | ~0.34x |

HyperLogLog (baseline) | 646 MB | 1 |

Based on both the measurements of compute time as well as the expected data transfer sizes it became clear the open source t-digest extension created by Postgres committer Tomas Vondra would yield the best performance for Citus and specifically for the analytics use case we were targeting.

## If you can’t beat them, join them

At this point, we had a good understanding of the speeds that could be achieved for percentile approximation with the different Postgres extensions we were considering. And the good news was, the projected speeds were well within our customer’s SLA.

**The hard question we had to answer: should we spend time to productize our own t-digest or HDR prototype extensions—or should we adopt (and try to improve) the existing open source t-digest extension?**

A quick chat with the author of the open source github.com/tvondra/tdigest extension, Tomas Vondra, revealed that he uses the extension to great satisfaction. Also, Tomas was open to contributions to the extension to make it work well for Citus. The documented use of the open source t-digest extension by at least 1 user was already infinitely more than our internal prototypes. And fixing a bug we had encountered was also straightforward. One pull request laterand we had an open source t-digest extension that would work with Citus.

The next step was to integrate t-digest with Citus. Given our experience with HLL in the past, the integration of t-digest with the Citus distributed Postgres planner was straightforward. It was time to go back to our customer to validate that we could meet their SLA. And we did meet our customer’s SLA: with t-digest and Hyperscale(Citus), our customer’s Postgres queries to approximate percentiles now ran in 7.5 seconds, which was **45 times faster compared to their initial 6 minute Postgres query**.

## More open source adoption of t-digest (with Citus)

One positive side-effect of embracing the existing open source t-digest and contributing back to the t-digest project: we found more users that were interested in using t-digest with Citus!

Min Wei from the Windows Data & Intelligence team—who uses Citus with Postgres on Azure to support over 6M queries per day—also needs to do percentile approximation at scale. And while working on this project we discovered that Min is looking to use the exact exact same t-digest extension to Postgres, too.

Matt Watson from Stackify recently published a blog about calculating percentiles with t-digest in Citus. Matt even helped improve the extension by documenting an edge case where the calculations were off, making t-digest work better for Stackify and pretty much all users—including the original author and all of our Citus users. Were we to have selected a closed source extension, our efforts would not have helped other customers like Stackify and Min Wei of Microsoft, and we would have run into similar bugs that we would have had to fix ourselves. By adopting an open source solution and improving it collectively, we make it work better for all.