Citus Blog

Articles tagged: topn

Claire Giordano

What’s new with Postgres at Microsoft (August 2023)

Written byBy Claire Giordano | August 31, 2023Aug 31, 2023

On one of the Postgres community chat forums, a friend asked me: "Is there a blog post that outlines all the work that is being done on Postgres at Microsoft? It's hard to keep track these days."

And my friend is right: it is hard to keep track. Probably because there are multiple Postgres workstreams at Microsoft, spread across a few different teams.

In this post, you'll get a bird's eye view of all the Postgres work the Microsoft team has done over the last year. Our work includes some pretty significant improvements to the Postgres managed services on Azure, as well as contributions across the entire open source ecosystem—including commits to the Postgres core; new releases to Postgres open source extensions like Citus and pg_cron; plus ecosystem work on Patroni, PgBouncer, pgcopydb. And more.

Keep reading
Craig Kerstiens

Approximation algorithms for your database

Written byBy Craig Kerstiens | February 28, 2019Feb 28, 2019

In an earlier blog post I wrote about how breaking problems down into a MapReduce style approach can give you much better performance. We've seen Citus is orders of magnitude faster than single node databases when we're able to parallelize the workload across all the cores in a cluster. And while count (*) and avg is easy to break into smaller parts I immediately got the question what about count distinct, or the top from a list, or median?

Exact distinct count is admittedly harder to tackle, in a large distributed setup, because it requires a lot of data shuffling between nodes. Count distinct is indeed supported within Citus, but at times can be slow when dealing with especially larger datasets. Median across any moderate to large size dataset can become completely prohibitive for end users. Fortunately for nearly all of these there are approximation algorithms which provide close enough answers and do so with impressive performance characteristics.

Keep reading
Ozgun Erdogan

Citus 7.5: The right way to scale SaaS apps

Written byBy Ozgun Erdogan | August 3, 2018Aug 3, 2018

One of the primary challenges with scaling SaaS applications is the database. While you can easily scale your application by adding more servers, scaling your database is a way harder problem. This is particularly true if your application benefits from relational database features, such as transactions, table joins, and database constraints.

At Citus, we make scaling your database easy. Over the past year, we added support for distributed transactions, made Rails and Django integration seamless, and expanded on our SQL support. We also documented approaches to scaling your SaaS database to thousands of customers.

Today, we’re excited to announce the latest release of our distributed database—Citus 7.5. With this release, we’re adding key features that make scaling your SaaS / multi-tenant database easier. If you’re into bulleted lists, these features include the following.

Keep reading

Today, we’re excited to announce Citus 7.3—the latest release of our distributed database that scales out Postgres. Citus 7.3 improves support for complex analytical queries, provides integration with Tableau and other BI tools, and integrates with the open source Postgres extension, TopN.

The features in this latest Citus database release are particularly important for real-time analytics workloads. In these workloads, users typically need to ingest data in real time and run analytical queries with sub-second response times. A good example is when you’re serving a dashboard to thousands of customers and your database needs to provide fast replies over billions of rows.

Here’s a quick overview of what’s new in Citus. For an overview of other recent Citus features, check out these blog entries about TopN for your Postgres database and Citus 7.2.

Keep reading
Furkan Sahin

TopN for your Postgres database

Written byBy Furkan Sahin | March 27, 2018Mar 27, 2018

People seem to love lists of the most popular things. I think this is true of many of us. Including developers. Did you get all excited like I did, and listen right away to every song on the list when Spotify released Your Top Songs 2017? (Here are mine) When the Academy Awards were announced, did you check in on the candidates and winners? Did you pay attention to the medalists and top scoring hockey teams in the Winter Olympics?

Sometimes this problem of finding the top on a list is referred to as the Top-K problem. Also the Top "N" problem. Whether it’s the top grossing sales reps or the most visited pages on your website, and whether you call it the Top K or the TopN, for most of us, there is usually something we want to know the top "N" of.

Finding the top "N" is not easy

To find the top occurring item you generally need to count through all the records. Counting the clicks in your web app, the number of times you’ve listened to song, or the number of downloads of your project. It is all about counting. Counting, sorting, and limiting the list in Postgres is straightforward, and this works great on smaller sets of data. What if there are thousands of events? Machines these days are pretty fast so this isn’t much of a problem. Millions is even acceptable. Billions? That may take a bit longer…

However, getting the counts of different items, sorting them and taking the top "N" of them out of your database—that can start to become much more challenging at larger scale.

Even further, what if you want to materialize your top N results for smaller sets in regular basis and run some combination queries to further analyze? The real problem starts then. Calculating the Top N can be a challenge. This is why my team at Citus Data (where we build the Citus extension to Postgres that scales out Postgres horizontally) is happy to announce the release of the open source TopN extension for PostgreSQL.

Inspiration for TopN came from a Citus Data customer who needed to use TopN-like functionality in concert with the Citus extension that scales out their Postgres database. When designing TopN, we decided to implement TopN as a Postgres extension. And we decided to write TopN in C. TopN outputs a JSONB object which you can flexibly use for different use cases. Aggregation functions which take JSONB input and union them together are also included.

TopN can be used to calculate the most frequently occurring values in a column, and is part of the class of probabilistic distinct algorithms called sketch algorithms. Let's look further at how the TopN extension to Postgres actually works.

Keep reading

Page 1 of 1