Distributed count distinct vs. HyperLogLog in Postgres

Citus 7.1 shipped just a few weeks back and included a number of great new features. In case you missed the details check out Ozgun’s blog or read up on what Citus is on our site. Today though we want to drill further into an important area in Postgres, counting.

Getting a distinct count of some value out of your database is a common question. We’ve talked about how to count more quickly on our blog before, and followed that up with how you can use probabilistic algorithms like HyperLogLog to do counts faster.

Citus 7.1 now excels with exact COUNT(DISTINCT) clauses

With Citus 7.1 you can now do distributed counts in a number of ways, from perfectly accurate distincts which do not include your partition column, or using the same great probabilistic algorithms you could before.

First, let’s lay out our environment. In this case we’re going to run a Citus database cluster with 4 nodes with 16 GB of memory each. Our initial test is going to load up a set of 100 million records, with varying amounts of unique numbers. Running this across 10K, 100K, 1M, and 10M unique records we can see a distribution of performance of:

Total Rows | Distinct Rows | Total Time | Worker Time | Data Transfer Time | Coordinator Time
100M       | 10K           |  00:02     |  *          |    *               | *
100M       | 100K          |  00:07     |  *          |    *               | *
100M       | 1M            |  00:51     |  00:07      |    00:24           | 00:20
100M       | 10M           |  02:26     |  00:11      |    01:17           | 00:58

What we can see in our initial test is: the more data we have the more the every step in the process increases but especially the data transfer time.

Let’s see what happens as we increase this to 500 million total records, and also increase the number of distinct values we’re counting:

500M       | 100K          |  00:26     |  00:16      |    *               |  *
500M       | 1M            |  01:02     |  00:20      |    00:20           |  00:22
500M       | 10M           |  07:11     |  00:20      |    04:00           |  02:41
500M       | 100M          |  13:44     |  00:25      |    07:30           |  06:14

Here again we see a bulk of the time is in the network transfer of the data. The positive of all of this is that we have perfectly exact distincts across our distributed cluster.

HyperLogLog still reigns supreme in Citus

SELECT COUNT(DISTINCT) now with Citus 7.1 is super easy to put in place and parallelizes well. But, we shouldn’t forget about HyperLogLog. At larger scale sketch algorithms like HyperLogLog can be really effective. (Sketches are probabilistic algorithms which can generate approximate results efficiently within mathematically provable error bounds.) So how does HyperLogLog measure up?

Total Rows | Distinct Rows | Total Time | HLL Time | 
100M       | 10K           |  00:02     |  00:01   | 
100M       | 100K          |  00:07     |  00:01   |
100M       | 1M            |  00:51     |  00:01   | 
100M       | 10M           |  02:26     |  00:01   |

500M       | 100K          |  00:26     |  00:06   |
500M       | 1M            |  01:02     |  00:07   | 
500M       | 10M           |  07:11     |  00:07   |
500M       | 100M          |  13:44     |  00:07   |

As we can see from our smaller datasets of 100M records, when we have a smaller set of unique values we have some performance gains by using HyperLogLog but they’re nowhere near as large as the performance gains with say 500M records and 100M unique values. As we can see in the table about at 500 million records, counting 100 million unique can be done 144 times faster with HyperLogLog.

It’s of note that HyperLogLog also parallelizes within Citus the same as we do for count distinct. We’ve been big fans of the Postgres HyperLogLog extension for many years and are excited to be taking on responsibility to maintain HyperLogLog going forward.

Count distinct or HyperLogLog

As with most things whether you want to use count distinct or HyperLogLog the answer is: it depends. If you have a dataset of 100 million or even 500 million records the performance on count distinct may work for you. Further, certain use cases you may not be able to get by with close enough counts–you may have to be perfectly accurate. For this case count distinct in a distributed fashion will still be faster than a single node for very large datasets. But, if you can get by with close counts HyperLogLog is your friend.