TopN for your Postgres database

People seem to love lists of the most popular things. I think this is true of many of us. Including developers. Did you get all excited like I did, and listen right away to every song on the list when Spotify released Your Top Songs 2017? (Here are mine) When the Academy Awards were announced, did you check in on the candidates and winners? Did you pay attention to the medalists and top scoring hockey teams in the Winter Olympics?

Sometimes this problem of finding the top on a list is referred to as the Top-K problem. Also the Top "N" problem. Whether it’s the top grossing sales reps or the most visited pages on your website, and whether you call it the Top K or the TopN, for most of us, there is usually something we want to know the top "N" of.

Finding the top "N" is not easy

To find the top occurring item you generally need to count through all the records. Counting the clicks in your web app, the number of times you’ve listened to song, or the number of downloads of your project. It is all about counting. Counting, sorting, and limiting the list in Postgres is straightforward, and this works great on smaller sets of data. What if there are thousands of events? Machines these days are pretty fast so this isn’t much of a problem. Millions is even acceptable. Billions? That may take a bit longer…

However, getting the counts of different items, sorting them and taking the top "N" of them out of your database—that can start to become much more challenging at larger scale.

Even further, what if you want to materialize your top N results for smaller sets in regular basis and run some combination queries to further analyze? The real problem starts then. Calculating the Top N can be a challenge. This is why my team at Citus Data (where we build the Citus extension to Postgres that scales out Postgres horizontally) is happy to announce the release of the open source TopN extension for PostgreSQL.

Inspiration for TopN came from a Citus Data customer who needed to use TopN-like functionality in concert with the Citus extension that scales out their Postgres database. When designing TopN, we decided to implement TopN as a Postgres extension. And we decided to write TopN in C. TopN outputs a JSONB object which you can flexibly use for different use cases. Aggregation functions which take JSONB input and union them together are also included.

TopN can be used to calculate the most frequently occurring values in a column, and is part of the class of probabilistic distinct algorithms called sketch algorithms. Let's look further at how the TopN extension to Postgres actually works.

Map the elements with the respective counts

TopN initializes a static sized hash-map structure and aggregates the data into the map. The size can be set by a GUC called topn.number_of_counters. The variable basically defines the number of distinct elements that we are interested in for one set. For the sake of accuracy, we allow the hash-map to grow as big as 3*number_of_counters during one aggregation. Whenever the distinct data count exceeds this number, the least frequent half of the data is flushed and the aggregation continues.

TopN operates across text values

TopN takes the input from the text datatype. If you want to make TopN work on a non-text column, you can cast your existing datatype to text. If you do need to cast your objects to text the resulting TopN list will be of the resulting text type as well.

Results in JSON

After the data ingestion is done, and the 'top N' number of elements are stored in hash-map, this hash-map is then returned to you within a JSONB object with the elements and their frequencies. For some of you, you may have been aggregating and storing counts within your database. You can start to use them with TopN generated JSONBs, since you can combine the results to make further analysis with aggregated union functions. You can also create GIN indexes and scan the counted objects in real time.

Materializing TopNs and continuous analysis

To be able to provide a similar use case that we dealt with for a customer of our Citus distributed database, we picked a github_events dataset for the first week of 2018. You can download and do the same tests by the following way.

wget http://data.githubarchive.org/2018-01-{01..07}-{0..23}.json.gz

After ingesting the data and eliminating some of the buckets where the date is null. The data size we have is:

# select pg_size_pretty(pg_total_relation_size('github_events'));
 pg_size_pretty
----------------
 7906 MB
(1 row)

The data set includes the events for 7 days. Let’s assume we provide a dashboard to our users which they can analyse the activity in the repositories in daily basis. We can aggregate and store the TopN elements for each day by the following way;

# create table aggregated_topns (day date, topn jsonb);
CREATE TABLE
Time: 9.593 ms

# insert into aggregated_topns select date_trunc('day', created_at), topn_add_agg((repo::json)->> 'name') as topn from github_events group by 1;
INSERT 0 7
Time: 34904.259 ms (00:34.904)

Here we are calculating the top 1000 repositories on each day and insert it into our aggregation table.

When a user is interested in the top 10 elements in the 2nd and 3rd days of the new year, we can simply union the two TopN JSONBs.

postgres=# select (topn(topn_union_agg(topn), 10)).* from aggregated_topns where day IN ('2018-01-02', '2018-01-03');
                      item                      | frequency
------------------------------------------------+-----------
 dipper-github-fra-sin-syd-nrt/test-ruby-sample |     12489
 wangshub/wechat_jump_game                      |      6402
 shenzhouzd/update                              |      6170
 SCons/scons                                    |      4593
 TheDimPause/thedimpause.github.io              |      3964
 nicopeters/sigrhtest                           |      3740
 curtclifton/curtclifton.github.io              |      3345
 CreatorB/hackerdroid                           |      3206
 dipper-github-icn-bom-cdg/test-ruby-sample     |      3126
 dotclear/dotclear                              |      2992
(10 rows)

Time: 7.750 ms

If the user is interested in top 2 for each of the first three days, it is also pretty straightforward;

postgres=# select day, (topn(topn, 2)).* from aggregated_topns where day IN ('2018-01-01', '2018-01-02', '2018-01-03');
    day     |                      item                      | frequency
------------+------------------------------------------------+-----------
 2018-01-01 | dipper-github-fra-sin-syd-nrt/test-ruby-sample |      9179
 2018-01-01 | shenzhouzd/update                              |      4543
 2018-01-02 | dipper-github-fra-sin-syd-nrt/test-ruby-sample |      7151
 2018-01-02 | SCons/scons                                    |      4593
 2018-01-03 | dipper-github-fra-sin-syd-nrt/test-ruby-sample |      5338
 2018-01-03 | CreatorB/hackerdroid                           |      3206
(6 rows)

Time: 4.037 ms

Give the new TopN extension to PostgreSQL a try today

Sketch algorithms like TopN and HyperLogLog provide powerful new ways to compute valuable information more easily. Here at Citus Data are excited to have created the TopN extension to Postgres and to make it available to the PostgreSQL community of users and developers.

If you are thinking about computing or saving the top <anything> today, give TopN a try.

Written by Furkan Sahin

Former Postgres and Citus software engineer at Microsoft. Percentile approximations for Postgres. Blockchain-based data distribution systems, distributed databases, & complex data visualizations. Loves distributed tables.