Blog posts by Burak Yucesoy on the Citus Blog

Efficient rollup tables with HyperLogLog in Postgres

Written by By Burak Yucesoy | June 30, 2017 Jun 30, 2017

HyperLogLog is an awesome approximation algorithm that addresses the distinct count problem. I am a big fan of HyperLogLog (HLL), so much so that I already wrote about the internals and how HLL solves the distributed distinct count problem. But there’s more to talk about, including HLL with rollup tables.

Rollup Tables and Postgres

Rollup tables are commonly used in Postgres when you don’t need to perform detailed analysis, but you still need to answer basic aggregation queries on older data.

With rollup tables, you can pre-aggregate your older data for the queries you still need to answer. Then you no longer need to store all of the older data, rather, you can delete the older data or roll it off to slower storage—saving space and computing power.

Let’s walk through a rollup table example in Postgres without using HLL.

Keep reading

Distributed count(distinct) with HyperLogLog on Postgres

Written by By Burak Yucesoy | April 4, 2017 Apr 4, 2017

Running SELECT COUNT(DISTINCT) on your database is all too common. In applications it's typical to have some analytics dashboard highlighting the number of unique items such as unique users, unique products, unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you can't just push query to the workers and add up results, because most likely there will be overlapping records in different workers. Instead you can do:

Pull all distinct data to one machine and count there. (Doesn't scale)
Do a map/reduce. (Scales but it's very slow)

This is where approximation algorithms or sketches come in. Sketches are probabilistic algorithms which can generate approximate results efficiently within mathematically provable error bounds. There are a many of them out there, but today we're just going to focus on one, HyperLogLog or HLL. HLL is very successfull for estimating unique number of elements in a list. First we'll look some at the internals of the HLL to help us understand why HLL algorithm is useful to solve distict count problem in a scalable way, then how it can be applied in a distributed fashion. Then we will see some examples of HLL usage.

Keep reading