Introducing Landlord: per tenant stats in Postgres with Citus

Written by Craig Kerstiens
July 31, 2018

Postgres keeps getting better and better. In recent years, the Postgres community has added JSONB support, improved performance, and added so many usability enhancements. The result: you can work even more powerfully with your database. Over the past 8 years, my favorite two enhancements have been JSONB and pg_stat_statements. Pg_stat_statements is a built-in extension that allows you to get high level insights into queries that are being run as well as their performance—without having to be an expert and without needing a PhD in databases.

Introducing the new landlord feature in Citus 7.5

With Citus 7.5, we've gone one step beyond the awesomeness of pg_stat_statements and Postgres, with the new landlord feature in Citus—to give you per-tenant stats.

If you're not (yet) familiar with our Citus distributed database, we serve a number of use cases from transactional SaaS applications (often for multi-tenant apps that need to grow, and grow, and grow) to powering real-time analytics for customer-facing dashboards.

Many of our multi-tenant customers often ask questions like:

  • Which tenant is most active?
  • How many of my queries are spanning shards?
  • Are any tenants creating load hotspots for me?

With the new landlord feature in Citus (citus_stat_statements), you can get clear insight into what your tenants are doing.

Getting started with landlord in Citus

Landlord parameterizes, like pg_stat_statements every query that is run against your database, how long it takes, and more information various internal activities of each query. Let's start at the high level by running a SELECT * to citus_stat_statements:

FROM citus_stat_statements;
  queryid   | userid | dbid  |                          query                          |   executor    | partition_key | calls 
 1175789717 |     10 | 12558 | select count(*) from emails;                            | real-time     |               |     1
 1175789717 |     10 | 12558 | select count(*) from emails_body;                       | task-tracker  |               |     1
 2120884273 |     10 | 12558 | select * from emails where org_id = $1                  | router        | 50000         |     2
 3589489629 |     10 | 12558 | insert into emails select * from emails_body;           | insert-select |               |     1
 3939672433 |     10 | 12558 | select * from emails_body where updated_at > $1         | real-time     |               |     1

Here you can see the queries that have been run, along with the executor type that was used, and the partition_key which is the shard or tenant key (if it was a router query).

From this data, we can create some really powerful reports that give us a lot of insight both about our overall application structure, as well as information on the load specific customers are presenting on the database.

Want to get the breakdown of how many cross shard queries you have vs. queries targeting a single shard?

SELECT to_char((cross_shard * 100) / total, '99D99') as cross_shard, 
       to_char((shard * 100) / total, '99D99') as shard 
    SELECT sum(calls) as total, 
           sum(calls) FILTER (WHERE partition_key is NULL) as cross_shard, 
           sum(calls) FILTER (WHERE partition_key is not null) as shard 
    FROM citus_stat_statements) 
cross_shard | shard  
17.23      |  82.77

For a multi-tenant application, queries that span across shards mean unnecessary network calls and slower performance. This query gives us a clear view that 17% of our queries are cross-shard and could potentially be optimized. By leveraging the Citus config variable, you can from here begin to error or alert on queries that are cross shard, and gradually start removing them from your app or modifying them accordingly.

But we can do more than just tackle things from an application wide perspective...

Which tenants are the noisy neighbors in your multi-tenant database?

With Citus we already could give you insights into which of your customers were storing the most data, creating the largest shards by proxy, and then allow you in a fully online fashion (that means zero downtime) to relocate those shards to less busy nodes in the Citus database cluster.

But sometimes, just because a tenant is occupying the most space in the distributed database, doesn't mean they're the most disruptive.

Have you ever lived in an apartment building where one tenant constantly had guests coming and going? It may not have been in the penthouse, maybe instead the tiny studio kid fresh out of college? Music was regularly playing loudly? And sometimes you weren't even sure who lived there because of everyone coming and going.

Think of this noisy neighbor in the apartment as the tenant in a multi-tenant database with lots of really frequent in and out queries creating disruption. Or it could be really long intensive queries running that are disrupting the performance of all the other tenants. The gist is that this one tenant is putting too much strain on the database.

With the landlord feature in Citus, because we record the tenant_id with each query, you can easily find which tenants are running the most queries, and which tenants are consuming the maximum amount of system time:

SELECT partition_key as tenant_id, 
       count(*) as tenant_unique_queries, 
       sum(calls) as tenant_total_queries,
       sum(total_time) as total_query_time 
FROM citus_stat_statements
WHERE partition_key is not null
GROUP BY tenant_id
ORDER BY tenant_total_queries DESC

 tenant_id | tenant_unique_queries | tenant_total_queries | total_query_time
 12        |                   148 |               159295 |        753142.54
 2         |                  2045 |                23846 |         12957.83
 1         |                    74 |                 9572 |          8492.05
 634       |                   175 |                12753 |          6876.11
 361       |                   375 |                 3653 |          6422.93
(5 rows)

Some results of this aren't too much of a surprise. Our tenant_id number 1 is our internal org where we dogfood our own product. Id 2 is our QA teams org. The unique queries here is especially high because they tend to test every area of the product including some longer running ones. Ids number 12, 634, and 361 are all top customers so it is not surprising they're in our top in terms of overall usage. But as you can see customer with id number 12 is consuming a far heavier amount of system resources than any other customer.

With this insight you have a few options, you could:

  1. Isolate that tenant to their own node.
  2. Performance tune by adding indexes just for that customer to give them a better experience.

You can adapt the above query to show you how many are leveraging your product based on the diversity of queries they're running, or look at the total queries alone to get an idea of who is most active.

Here’s to better management of your tenants (including your noisy tenants!)

Building a SaaS app you have plenty of challenges. But you shouldn't have to spend all your cycles debugging and troubleshooting performance of your app based on how your customers use your product. With landlord you can get the insights you need, then seamlessly isolate tenants or rebalance then to more equally distribute your load. And of course when you need more resources with Citus it can be as simple as dragging a slider.

You can try landlord today by upgrading your existing Citus cluster to our newly released version 7.5.

Craig Kerstiens

Written by Craig Kerstiens

Former Head of Cloud at Citus Data. Ran product at Heroku Postgres. Countless conference talks on Postgres & Citus. Loves bbq and football.