Hyperscale (Citus) is now available on Azure Database for PostgreSQL. Want to learn more?
In an earlier blog post I wrote about how breaking problems down into a MapReduce style approach can give you much better performance. We’ve seen Citus is orders of magnitude faster than single node databases when we’re able to parallelize the workload across all the cores in a cluster. And while
count (*) and
avg is easy to break into smaller parts I immediately got the question what about count distinct, or the top from a list, or median?
Exact distinct count is admittedly harder to tackle, in a large distributed setup, because it requires a lot of data shuffling between nodes. Count distinct is indeed supported within Citus, but at times can be slow when dealing with especially larger datasets. Median across any moderate to large size dataset can become completely prohibitive for end users. Fortunately for nearly all of these there are approximation algorithms which provide close enough answers and do so with impressive performance characteristics.Continue reading