Citus Blog

Articles tagged: approximation algorithms

When working on the internals of Citus, an open source extension to Postgres that transforms Postgres into a distributed database, we often get to talk with customers that have interesting challenges you won't find everywhere. Just a few months back, I encountered an analytics workload that was a really good fit for Citus.

But we had one problem: the percentile calculations on their data (over 300 TB of data) could not meet their SLA of 30 seconds.

To make things worse, the query performance was not even close to the target: the percentile calculations were taking about 6 minutes instead of the required 30 second SLA.

Keep reading
Marco Slot

What’s new in the Citus 9.4 extension to Postgres

Written byBy Marco Slot | September 5, 2020Sep 5, 2020

Our latest release to the Citus extension to Postgres is Citus 9.4. If you’re not yet familiar, Citus transforms Postgres into a distributed database, distributing your data and your SQL queries across multiple nodes. This post is basically the Citus 9.4 release notes.

If you’re ready to get started with Citus, it’s easy to download Citus open source packages for 9.4.

I always recommend people check out docs.citusdata.com to learn more. The Citus documentation has rigorous tutorials, details on every Citus feature, explanations of key concepts—things like choosing the distribution column—tutorials on how you can set up Citus locally on a single server, how to install Citus on multiple servers, how to build a real-time analytics dashboard, how to build a multi-tenant database, and more...

Keep reading
Craig Kerstiens

Approximation algorithms for your database

Written byBy Craig Kerstiens | February 28, 2019Feb 28, 2019

In an earlier blog post I wrote about how breaking problems down into a MapReduce style approach can give you much better performance. We've seen Citus is orders of magnitude faster than single node databases when we're able to parallelize the workload across all the cores in a cluster. And while count (*) and avg is easy to break into smaller parts I immediately got the question what about count distinct, or the top from a list, or median?

Exact distinct count is admittedly harder to tackle, in a large distributed setup, because it requires a lot of data shuffling between nodes. Count distinct is indeed supported within Citus, but at times can be slow when dealing with especially larger datasets. Median across any moderate to large size dataset can become completely prohibitive for end users. Fortunately for nearly all of these there are approximation algorithms which provide close enough answers and do so with impressive performance characteristics.

Keep reading
Ozgun Erdogan

Citus 7.5: The right way to scale SaaS apps

Written byBy Ozgun Erdogan | August 3, 2018Aug 3, 2018

One of the primary challenges with scaling SaaS applications is the database. While you can easily scale your application by adding more servers, scaling your database is a way harder problem. This is particularly true if your application benefits from relational database features, such as transactions, table joins, and database constraints.

At Citus, we make scaling your database easy. Over the past year, we added support for distributed transactions, made Rails and Django integration seamless, and expanded on our SQL support. We also documented approaches to scaling your SaaS database to thousands of customers.

Today, we’re excited to announce the latest release of our distributed database—Citus 7.5. With this release, we’re adding key features that make scaling your SaaS / multi-tenant database easier. If you’re into bulleted lists, these features include the following.

Keep reading

Today, we’re excited to announce Citus 7.3—the latest release of our distributed database that scales out Postgres. Citus 7.3 improves support for complex analytical queries, provides integration with Tableau and other BI tools, and integrates with the open source Postgres extension, TopN.

The features in this latest Citus database release are particularly important for real-time analytics workloads. In these workloads, users typically need to ingest data in real time and run analytical queries with sub-second response times. A good example is when you’re serving a dashboard to thousands of customers and your database needs to provide fast replies over billions of rows.

Here’s a quick overview of what’s new in Citus. For an overview of other recent Citus features, check out these blog entries about TopN for your Postgres database and Citus 7.2.

Keep reading
Furkan Sahin

TopN for your Postgres database

Written byBy Furkan Sahin | March 27, 2018Mar 27, 2018

People seem to love lists of the most popular things. I think this is true of many of us. Including developers. Did you get all excited like I did, and listen right away to every song on the list when Spotify released Your Top Songs 2017? (Here are mine) When the Academy Awards were announced, did you check in on the candidates and winners? Did you pay attention to the medalists and top scoring hockey teams in the Winter Olympics?

Sometimes this problem of finding the top on a list is referred to as the Top-K problem. Also the Top "N" problem. Whether it’s the top grossing sales reps or the most visited pages on your website, and whether you call it the Top K or the TopN, for most of us, there is usually something we want to know the top "N" of.

Finding the top "N" is not easy

To find the top occurring item you generally need to count through all the records. Counting the clicks in your web app, the number of times you’ve listened to song, or the number of downloads of your project. It is all about counting. Counting, sorting, and limiting the list in Postgres is straightforward, and this works great on smaller sets of data. What if there are thousands of events? Machines these days are pretty fast so this isn’t much of a problem. Millions is even acceptable. Billions? That may take a bit longer…

However, getting the counts of different items, sorting them and taking the top "N" of them out of your database—that can start to become much more challenging at larger scale.

Even further, what if you want to materialize your top N results for smaller sets in regular basis and run some combination queries to further analyze? The real problem starts then. Calculating the Top N can be a challenge. This is why my team at Citus Data (where we build the Citus extension to Postgres that scales out Postgres horizontally) is happy to announce the release of the open source TopN extension for PostgreSQL.

Inspiration for TopN came from a Citus Data customer who needed to use TopN-like functionality in concert with the Citus extension that scales out their Postgres database. When designing TopN, we decided to implement TopN as a Postgres extension. And we decided to write TopN in C. TopN outputs a JSONB object which you can flexibly use for different use cases. Aggregation functions which take JSONB input and union them together are also included.

TopN can be used to calculate the most frequently occurring values in a column, and is part of the class of probabilistic distinct algorithms called sketch algorithms. Let's look further at how the TopN extension to Postgres actually works.

Keep reading
Murat Tuncer

Distributed count distinct vs. HyperLogLog in Postgres

Written byBy Murat Tuncer | December 22, 2017Dec 22, 2017

Citus 7.1 shipped just a few weeks back and included a number of great new features. In case you missed the details check out Ozgun’s blog or read up on what Citus is on our site. Today though we want to drill further into an important area in Postgres, counting.

Getting a distinct count of some value out of your database is a common question. We've talked about how to count more quickly on our blog before, and followed that up with how you can use probabilistic algorithms like HyperLogLog to do counts faster.

Keep reading
Craig Kerstiens

PGConf EU: HyperLogLog, Eclipse, and Distributed Postgres

Written byBy Craig Kerstiens | December 11, 2017Dec 11, 2017

We're big fans of Postgres and enjoy getting around to the various community conferences to give talks on relevant topics as well as learn from others. A few months ago we had a good number of Citus team members over at the largest Postgres conference in Europe. Additionally, three of our Citus team members gave talks at the conference. We thought for those of you that couldn't make the conference you might still enjoy getting a glimpse of some of the content. You can browse the full set of talks that were given and slides for them on the PGConf EU website or flip through the presentations from members of the Citus team below.

Keep reading
Burak Yucesoy

Efficient rollup tables with HyperLogLog in Postgres

Written byBy Burak Yucesoy | June 30, 2017Jun 30, 2017

HyperLogLog is an awesome approximation algorithm that addresses the distinct count problem. I am a big fan of HyperLogLog (HLL), so much so that I already wrote about the internals and how HLL solves the distributed distinct count problem. But there’s more to talk about, including HLL with rollup tables.

Rollup Tables and Postgres

Rollup tables are commonly used in Postgres when you don’t need to perform detailed analysis, but you still need to answer basic aggregation queries on older data.

With rollup tables, you can pre-aggregate your older data for the queries you still need to answer. Then you no longer need to store all of the older data, rather, you can delete the older data or roll it off to slower storage—saving space and computing power.

Let’s walk through a rollup table example in Postgres without using HLL.

Keep reading

Running SELECT COUNT(DISTINCT) on your database is all too common. In applications it's typical to have some analytics dashboard highlighting the number of unique items such as unique users, unique products, unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you can't just push query to the workers and add up results, because most likely there will be overlapping records in different workers. Instead you can do:

  • Pull all distinct data to one machine and count there. (Doesn't scale)
  • Do a map/reduce. (Scales but it's very slow)

This is where approximation algorithms or sketches come in. Sketches are probabilistic algorithms which can generate approximate results efficiently within mathematically provable error bounds. There are a many of them out there, but today we're just going to focus on one, HyperLogLog or HLL. HLL is very successfull for estimating unique number of elements in a list. First we'll look some at the internals of the HLL to help us understand why HLL algorithm is useful to solve distict count problem in a scalable way, then how it can be applied in a distributed fashion. Then we will see some examples of HLL usage.

Keep reading

Page 1 of 1