Shard rebalancing in the Citus 10.1 extension to Postgres

Written by Jelte Fennema-Nio
September 3, 2021

With the 10.1 release to the Citus extension to Postgres, you can now monitor the progress of an ongoing shard rebalance—plus you get performance optimizations, as well as some user experience improvements to the rebalancer, too.

Whether you use Citus open source to scale out Postgres, or you use Citus in the cloud, this post is your guide to what’s new with the shard rebalancer in Citus 10.1.

And if you’re wondering when you might need to use the shard rebalancer: the rebalancer is used when you add a new Postgres node to your existing Citus database cluster and you want to move some of the old data to this new node, to “balance” the cluster. There are also times you might want to balance shards across nodes in a Citus cluster in order to optimize performance. A common example of this is when you have a SaaS application and one of your customers/tenants has significant more activity than the rest.

And if you haven’t yet heard the exciting news: in Citus 10, earlier in 2021, we open sourced the shard rebalancer. My previous blog post explains more about what that change means for you.

Monitoring progress of a shard move

When you distribute a Postgres table with Citus, the table is usually distributed across multiple nodes. Often people refer to this as “sharding” the Postgres table across multiple nodes in a cluster. And in Citus-speak, these smaller components of the distributed table are called “shards”.

Once you start rebalancing shards, the rebalancer will tell you which shard it’s going to move. One of the most frequent questions some of you ask is: “how long will a shard move take?” The answer depends both on the amount of data on the shard that’s being moved and the speed at which this data is being moved: a shard rebalance might take minutes, hours, or even days to complete.

With Citus 10.1, it’s now easy for you to monitor the progress of the rebalance. While monitoring the progress won’t directly tell you exactly how long the rebalance will take, the good news is that you’ll be able to see that the rebalance is still progressing and roughly how far along it is.

To see the progress, you can use the get_rebalance_progress function. This function already existed in previous Citus versions, but in the Citus 10.1 release it has been made much more useful: get_rebalance_progress will now tell you the size of the shard on both the source and target node! To view this information you can run the following query (in a different session than the one in which you are running the rebalance itself):

SELECT * FROM get_rebalance_progress();
┌───────────┬────────────┬─────────┬────────────┬────────────┬────────────┬────────────┬────────────┬──────────┬───────────────────┬───────────────────┐
 sessionid  table_name  shardid  shard_size  sourcename  sourceport  targetname  targetport  progress  source_shard_size  target_shard_size 
├───────────┼────────────┼─────────┼────────────┼────────────┼────────────┼────────────┼────────────┼──────────┼───────────────────┼───────────────────┤
     13524  customers    102008    46718976  localhost         9701  localhost         9702         2           46686208           46718976 
     13524  orders       102024    52355072  localhost         9701  localhost         9702         2           52322304           52355072 
     13524  customers    102012    46628864  localhost         9701  localhost         9703         2           46604288           46628864 
     13524  orders       102028    52264960  localhost         9701  localhost         9703         2           52232192           52264960 
     13524  customers    102016    46669824  localhost         9701  localhost         9704         1           46669824           46702592 
     13524  orders       102032    52297728  localhost         9701  localhost         9704         1           52297728                  0 
     13524  customers    102020    46702592  localhost         9701  localhost         9702         0           46702592                  0 
     13524  orders       102036    52338688  localhost         9701  localhost         9702         0           52338688                  0 
└───────────┴────────────┴─────────┴────────────┴────────────┴────────────┴────────────┴────────────┴──────────┴───────────────────┴───────────────────┘

The values in the progress column have the following meaning:
0: Not yet started
1: In progress
2: Finished

Using this knowledge, you can use the following SQL query to zoom in on the progress of the shard moves that the Citus shard rebalancer is currently doing, using the following queries:

-- To show the progress for each shard that's currently being moved
SELECT
    table_name,
    shardid,
    pg_size_pretty(source_shard_size) AS  source_shard_size,
    pg_size_pretty(target_shard_size) AS target_shard_size,
    CASE WHEN shard_size = 0
        THEN 100
        ELSE LEAST(round(target_shard_size::numeric / shard_size * 100, 2), 100)
    END AS percent_completed_estimate
FROM get_rebalance_progress()
WHERE progress = 1;
┌────────────┬─────────┬───────────────────┬───────────────────┬────────────────────────────┐
 table_name  shardid  source_shard_size  target_shard_size  percent_completed_estimate 
├────────────┼─────────┼───────────────────┼───────────────────┼────────────────────────────┤
 customers    102013  44 MB              44 MB                                     100 
 orders       102029  50 MB              23 MB                                   45.85 
└────────────┴─────────┴───────────────────┴───────────────────┴────────────────────────────┘

-- To show the progress for the colocation group that's being moved as a whole
SELECT
   pg_size_pretty(sum(source_shard_size)) AS source_shard_size,
   pg_size_pretty(sum(target_shard_size)) AS target_shard_size,
   CASE WHEN sum(shard_size) = 0
     THEN 100
     ELSE LEAST(round(sum(target_shard_size)::numeric / sum(shard_size) * 100, 2), 100)
   END AS percent_completed_estimate
FROM get_rebalance_progress()
WHERE progress = 1;
┌───────────────────┬───────────────────┬────────────────────────────┐
 source_shard_size  target_shard_size  percent_completed_estimate 
├───────────────────┼───────────────────┼────────────────────────────┤
 95 MB              45 MB                                   47.15 
└───────────────────┴───────────────────┴────────────────────────────┘

Using the percent_completed_estimate you can get a rough indication of how long the move of a shard will take by doing some basic math yourself. If the percent_completed_estimate went from 30% to 40% in 10 minutes then the rate at which the shard move is happening is 10% per 10 minutes. Since there’s still 60% left, it means that you will have to wait roughly 60 minutes for the shard move to complete. However, there are some important caveats that can make this time estimate imprecise in some cases:

  • The percent_completed_estimate column in this query is more of a rough indication than a precise progress bar. Why? It’s possible that after the move, the shard will be smaller than before—because by moving the data, a Postgres VACUUM was essentially run on the table. So, the percent_completed_estimate might not be able to reach 100% because the shard is now smaller in size than it was originally.
  • If one shard seems to be stuck, but another shard is still making progress, chances are that the shard that seems stuck is already finished with the data transfer. If you’re wondering why such a shard is still marked as “in progress”: it’s because a shard won’t be marked as “finished” until all of the shards it is colocated with have also been moved successfully.
  • There’s another important thing to keep in mind when you’re looking at the shard rebalancer progress: At the end of the rebalance, indexes are created. Index creation can also take a long time, although the data transfer usually takes longer. Sadly, there’s currently no indication of the progress during index creation in the progress monitor output.

What this means for you: Don’t stop the rebalancer right away if you think there’s no progress, even if all shards are almost finished with the transfer. If you have Postgres indexes on your Citus table, it’s very likely you’ll just have to wait a bit longer until the indexes are finished creating. If you want to be sure that this is what’s going on, the easiest way to confirm is to connect directly to the Citus worker node and look at pg_stat_activity. If there’s a CREATE INDEX query running, then there’s a very good chance that the rebalancer is currently very busy creating your Postgres indexes.

Optimizing how shard rebalancing works with Postgres partitioning

Similar to Postgres query planning, the first thing that the shard rebalancer does when rebalancing is to figure out a plan for where to move each shard. To build this plan, the rebalancer needs the sizes of your shards on disk. In previous versions of Citus, getting these sizes could become very slow when a cluster contained thousands of shards. This situation mostly happened when a cluster contained partitioned tables, because Postgres partitions often cause a lot of shards to be created (one for each partition). As of 10.1, the way the rebalancer gets the shard sizes is optimized, making this operation fast even when there are tons of shards in the cluster.

Checking that there’s space on the target node

Running out of disk is one of the worst things that can happen to a Postgres server. When that happens, Postgres will crash. The only way to start Postgres again is to either get a larger disk or remove unnecessary files, such as log files.

The shard rebalancer often moves big chunks of data across nodes in the distributed cluster. That’s why we added a new safety feature to the shard rebalancer: before actually moving a shard, the Citus rebalancer will first check if there’s enough space on the target node to actually store it (plus some margin for safety).

Deferred dropping of shards by default

Citus 10.1 also solves an annoying issue that some of you may have run into: Prior to Citus 10.1, a shard move would sometimes be cancelled and rolled back right when the move would normally finish. Why? After moving all the data, the rebalancer used to drop the shard on the source node. However, when other queries were still reading or writing to this shard, the deletion could result in a distributed deadlock. This deadlock would only happen with transactions that were highly concurrent, and when you had acquired locks in the reverse order within a very short time window.

We have now enabled deferred dropping of shards by default, which makes sure such deadlocks do not occur. Deferred dropping means that instead of dropping the shard on the old node right away, shards are only marked for deletion. Then after a while, only after all the read queries have finished, will the shard actually be dropped. We call such a shard that is waiting to be dropped an “orphaned shard”.

This feature is not strictly new in Citus 10.1. It was already possible to enable this in previous versions by setting citus.defer_drop_after_shard_move to true. However, when not used with care, the previous implementation could result in confusing errors. This release we spent on hardening this feature and improving all of the error messages so they make sense. That’s why deferred dropping is now turned on by default—because we’re now confident that it works as well as we want it to.

The Citus shard rebalancer in 10.1: happier, faster, and with a way to monitor

With Citus 10.1, you will be much happier when using the shard rebalancer to balance the data sizes across the nodes in your cluster. Your shards will be moved faster. You can see the progress being made. And finally, your shard moves will not be rolled back anymore because Citus avoids the distributed deadlocks I mentioned earlier.

If you want to learn more about the Citus rebalancer (recently made open source!) you might want to check out my previous post or check out the docs on the shard rebalancer and the rebalancer strategies.

If you’re new to Citus, you can find a good collection of getting started resources online here. And as always if you have any questions, you can find me and the rest of the Citus engineers—and other uses, too—on our Citus slack.

Jelte Fennema-Nio

Written by Jelte Fennema-Nio

Postgres and Citus developer at Microsoft. Low latency APIs at Stream. BSc in Computer Science and MSc in System & Network Engineering from U of Amsterdam. Rust. Hot sun. Cold beer.

JelteF