People seem to love lists of the most popular things. I think this is true of many of us. Including developers. Did you get all excited like I did, and listen right away to every song on the list when Spotify released Your Top Songs 2017? (Here are mine) When the Academy Awards were announced, did you check in on the candidates and winners? Did you pay attention to the medalists and top scoring hockey teams in the Winter Olympics?
Sometimes this problem of finding the top on a list is referred to as the Top-K problem. Also the Top “N” problem. Whether it’s the top grossing sales reps or the most visited pages on your website, and whether you call it the Top K or the TopN, for most of us, there is usually something we want to know the top “N” of.
Finding the top “N” is not easy
To find the top occurring item you generally need to count through all the records. Counting the clicks in your web app, the number of times you’ve listened to song, or the number of downloads of your project. It is all about counting. Counting, sorting, and limiting the list in Postgres is straightforward, and this works great on smaller sets of data. What if there are thousands of events? Machines these days are pretty fast so this isn’t much of a problem. Millions is even acceptable. Billions? That may take a bit longer…
However, getting the counts of different items, sorting them and taking the top “N” of them out of your database—that can start to become much more challenging at larger scale.
Even further, what if you want to materialize your top N results for smaller sets in regular basis and run some combination queries to further analyze? The real problem starts then. Calculating the Top N can be a challenge. This is why my team at Citus Data (where we build the Citus extension to Postgres that scales out Postgres horizontally) is happy to announce the release of the open source TopN extension for PostgreSQL.
Inspiration for TopN came from a Citus Data customer who needed to use TopN-like functionality in concert with the Citus extension that scales out their Postgres database. When designing TopN, we decided to implement TopN as a Postgres extension. And we decided to write TopN in C. TopN outputs a JSONB object which you can flexibly use for different use cases. Aggregation functions which take JSONB input and union them together are also included.
TopN can be used to calculate the most frequently occurring values in a column, and is part of the class of probabilistic distinct algorithms called sketch algorithms. Let’s look further at how the TopN extension to Postgres actually works.