Scaling out Postgres with the Citus open source shard rebalancer

Written by Jelte Fennema-Nio
March 13, 2021

One of the main reasons people use the Citus extension for Postgres is to distribute the data in Postgres tables across multiple nodes. Citus does this by splitting the original Postgres table into multiple smaller tables and putting these smaller tables on different nodes. The process of splitting bigger tables into smaller ones is called sharding—and these smaller Postgres tables are called “shards”. Citus then allows you to query the shards as if they were still a single Postgres table.

One of the big changes in Citus 10—in addition to adding columnar storage, and the new ability to shard Postgres on a single Citus node—is that we open sourced the shard rebalancer.

Yes, that’s right, we have open sourced the shard rebalancer! The Citus 10 shard rebalancer gives you an easy way to rebalance shards across your cluster and helps you avoid data hotspots over time. Let’s dig into the what and the how.

Why use the shard rebalancer?

When Citus initially splits a Postgres table into multiple shards, Citus distributes the shards across the nodes in your cluster (unless of course you’re running Citus on a single node.) This is done to divide the workload across the different nodes. Over time, both the traffic to the cluster and the amount of data stored in your database usually increase. At some point, you might want to add some more nodes to the cluster, to lighten the workload on each individual node. This is called “scaling out.”

There’s one problem though: When you first add nodes to an existing Citus cluster, there’s no data on them yet. This is because all the shards are still on the old nodes. So, all these new nodes will just be doing nothing.

This is where shard rebalancing comes in. Shard rebalancing ensures that the shards are distributed fairly across all nodes. The Citus shard rebalancer does this by moving shards from one server to another.

To rebalance shards after adding a new node, you can use the rebalance_table_shards function:

SELECT rebalance_table_shards();
shard rebalancer diagram 1
Diagram 1: Node C was just added to the Citus cluster, but no shards are stored there yet. All Postgres queries will still only go to Nodes A and B because A and B still contain all the data.
shard rebalancer diagram 2
Diagram 2: By running rebalance_table_shards() the data is distributed evenly across the Citus nodes, including the new Node C.

Different rebalancing strategies

By default, Citus divides the shards across the nodes in such a way that every node has the same number of shards. This approach works fine for a lot of workloads. However, when shards have significantly different sizes, this can lead to one node having much more data than another.

One common scenario where you end up with differently-sized shards is with multi-tenant SaaS applications. Most SaaS apps use the customer_id as the sharding key (in our docs we call this the distribution column.) If you work on a SaaS app, sometimes a few of your customers have more activity and store a lot more data than the rest. Thus, the shards containing the data for these big customers are probably much larger than the other shards in your Citus cluster.

Luckily Citus can use different strategies when rebalancing shards. By default, Citus uses the simple by_shard_count strategy, but in a multi-tenant SaaS scenario like this, you might want to use the by_disk_size strategy instead.

To rebalance the shards such that the size of the shards is taken into account, you can use the following SQL query:

SELECT rebalance_table_shards(rebalance_strategy := 'by_disk_size');
shard rebalancer diagram 3
Diagram 3: Nodes A and B both have two shards, but the shards that are stored on Node B contain much more data. This can result in uneven distribution of SQL queries and Node B running out of disk way before Node A does.
shard rebalancer diagram 4
Diagram 4: By running rebalance_table_shards(rebalance_strategy := 'by_disk_size') shards are moved around in such a way that all of the Citus nodes store roughly the same amount of data.

By default, Citus comes with the two rebalance strategies we covered (by_disk_size and by_shard_count). You can also add your own rebalance strategies in case these two don’t match what’s needed for your workload. Our Citus docs have various examples of different rebalance strategies. With help from these examples, you can create your own rebalance strategies to:

  1. Isolate a shard to a specific node. For instance, a shard containing an important customer that you’ve acquired dedicated hardware for.
  2. Divide shards based on number of queries going to the shard. This can be useful if you are bottlenecked on CPU and you want to distribute the queries more evenly across servers. Often balancing by number of queries would result in a balance similar to by_disk_size, since more queries usually means more data. This isn’t the case for all workloads though, so in those cases it can be better to create your own strategy based on the number of queries.
  3. Make the rebalancer aware of difference in capacity between nodes. This can be useful if half of your nodes have 1 TB disks and the other half have 2 TB disks. In this case you probably want twice the amount of shards on the nodes with 2 TB disks.

Shrinking your cluster

There’s one more use case for rebalancing shards that Citus 10 supports by default: You realise your workload could be handled with fewer nodes and you’d like to save some money on server costs. Of course, you won’t want to lose access to any of the data. So before physically turning off any nodes, you will want to move those shards to the servers you plan to keep.

Citus also supports this “scaling in” using citus_drain_node:

SELECT citus_drain_node('10.0.0.1', 5432);

Understanding what the shard rebalancer does

The way Citus does shard rebalancing consists of two phases. In the first phase Citus generates a plan. This plan contains the moves that are needed to divide the shards fairly across the nodes.

Just like you can use EXPLAIN to understand what a PostgreSQL query will do without executing the query, the same can be done for the Citus shard rebalancer using get_rebalance_table_shards_plan, with this query:

SELECT get_rebalance_table_shards_plan();

Or if you want to see what happens if you use a different rebalance strategy, you can see plans for alternate rebalance strategies too:

SELECT get_rebalance_table_shards_plan(
    rebalance_strategy := 'by_disk_size'
);

Then during the second phase of rebalancing, Citus moves the shards one-by-one, according to the generated plan. You can still read from a shard while it is being moved, but writing to it is blocked (meanwhile, writing to other shards can continue normally). If you are using Citus on Azure, you can also continue writing to the shard that’s being moved and that’s because we are able to use some extra tricks in Azure. October 2022: Good news, as of Citus 11.0, the Citus open source shard rebalancer is also non-blocking for writes.

If you want to move a shard to a specific node yourself, instead of according to the rebalance plan, you can do that using the citus_move_shard_placement function. There’s a downside to manually moving shards though: Running rebalance_table_shards could undo this manual change without you realising it. To avoid this, it’s recommended to create your own rebalance strategy. This way you can make the shard rebalancer aware that you want to have a shard on a specific node.

An important part of your Postgres toolbox for scaling out

You won’t need to use the shard rebalancer in the beginning when you’re getting started with Citus to scale out Postgres. But it’s good to know that the Citus shard rebalancer is there for you. Because at some point as your application grows, you will likely want to rebalance.

Sometimes people talk about things like database hygiene or Postgres toolboxes. Well, the Citus shard rebalancer should be part of your toolbox because it ensures your Citus cluster continues to perform well over time. And now that—big news!—we’ve open sourced the shard rebalancer, Citus 10 gives you a very easy way to scale and grow your Citus cluster.

Jelte Fennema-Nio

Written by Jelte Fennema-Nio

Postgres and Citus developer at Microsoft. Low latency APIs at Stream. BSc in Computer Science and MSc in System & Network Engineering from U of Amsterdam. Rust. Hot sun. Cold beer.

JelteF