Diary of an Engineer: Improved Citus metadata sync for Postgres in 11.3

One of the most important improvements in Citus 11.3 is that Citus offers more reliable metadata sync. Before 11.3, when a Citus cluster had thousands of distributed objects (such as distributed tables), Citus occasionally experienced memory problems while running metadata sync. Due to these memory errors, some users with very large numbers of tables were sometimes unable to add new nodes or upgrade beyond Citus 11.0.

To address the memory issues, we added an alternative "non-transactional" mode to the current metadata sync in Citus 11.3.

The default mode for metadata sync is still the original single transaction mode that we introduced in Citus 11.0. But now in 11.3 or later, if you have a very large number of tables and you run into the memory error, you can choose to optionally switch to the non-transactional mode, which syncs the metadata via many transactions. While most of you who use Citus will not need to enable this alternative metadata sync mode, this is how to do it:

SET citus.metadata_sync_mode TO 'nontransactional';

It is also recommended to switch back to the default mode after a successful metadata sync like below:

SET citus.metadata_sync_mode TO 'transactional';

A common use case in which you might encounter memory problems during metadata sync is when you are using Citus for its excellent time-series capabilities, including time-partitioning and seamless partition scaling. With time series data and distributed PostgreSQL, you often end up with tens of thousands of tables containing hundreds of thousands of shards, resulting in an excessive number of distributed objects that need to be synchronized across worker nodes. This large-scale sync process can lead to Postgres memory problems during the Citus metadata sync.

This blog post will cover the fundamentals of Citus metadata synchronization, explore some of the memory problems related to it, and finally explain how we solved the problems in Citus 11.3:

Importance of metadata sync
When does Citus sync the metadata?
How did we prevent high memory consumption?
How did we prevent Postgres's hard memory error?
An alternative metadata sync added to 11.3 as an option (turned off by default)
Performance of the new alternative metadata sync
Future plans & possible improvements
Corner cases and gotchas

NOTICE: This post contains some technical details on how we made metadata sync more reliable. Hence "diary of an engineer" in the the title of this blog post. If you're a Citus user and don't need the implementation details, feel free to skip sections 3 and 4.

Importance of metadata sync

First a bit of history. The 11.0 open source release of the Citus extension to Postgres is groundbreaking in that it eliminates the bottleneck of queries being served only from the coordinator. After 11.0, all nodes in the cluster can be utilized to their full potential, enabling any insert, update, or select query to be served by any node. This "query from any node" capability is made possible by Citus syncing the meta-information about the cluster from the coordinator to all other nodes in the Citus database cluster.

Prior to 11.0, the metadata only existed on the coordinator. Then, starting in 11.0, Citus would sync the metadata from the coordinator to all nodes. You can consider the metadata—which is shared among all nodes—as the source of truth about the cluster. For example, a distributed table is divided into multiple shards, with 32 being the default. The number of shards is a part of the metadata. How does Citus find the location of a shard? The Citus metadata tells us!

When does Citus need to sync the metadata?

There are 2 situations where Citus runs a metadata sync:

citus_add_node: Whenever you choose to add new nodes to the cluster to scale reads and writes, Citus will sync the metadata.
citus_sync_metadata_to_all_nodes: Whenever you upgrade your Citus version—starting from Citus 11.0 or later—Citus will automatically sync the metadata at the coordinator to all active workers.

How did we prevent high memory consumption?

Prior to 11.3, when a Citus cluster contained very large numbers of distributed objects, particularly numerous partitions, running out of memory used to be highly probable during metadata sync. Below is how it looks like when Postgres throws a memory error since the metadata sync consumes a lot of memory:

SELECT citus_add_node(<ip>, <port>);
ERROR:  out of memory

SELECT start_metadata_sync_to_all_nodes();
ERROR:  out of memory

Citus uses Postgres's memory allocation and deallocation API, which is referred to as MemoryContext.

All Citus allocations and deallocations are performed on one of the special memory areas managed by the memory API.
One of these areas is called TopTransactionContext, which is utilized for memory allocations following the start of a transaction.
All allocations made during a transaction are released by Postgres at the conclusion of the transaction.

For metadata sync to succeed, Citus must generate and propagate the necessary commands (DDLs) for each distributed object from the coordinator to the workers to complete the object's synchronization:

As Citus synchronizes metadata within a single transaction, TopTransactionContext may quickly become too large, leading to the OOM killer killing the session or an out of memory error being thrown by Postgres.
Although Postgres normally frees all allocated memory inside TopTransactionContext at the end of the transaction, this may be too late for metadata sync to succeed without encountering a memory crash or error.

To solve this memory issue in 11.3, instead of waiting for the end of the transaction to free the memory that is allocated during metadata sync, we now release the memory after syncing each object.

You can grasp the overall idea better via these pseudo-code blocks

Before Citus 11.3 (below): waiting until the end of the transaction to free all allocated memory during metadata sync

startTransaction() # all memory allocation automatically done at TopTransactionContext

for obj in allObjects:
        ddlCommands = generateDDLsForObject(obj)
        syncObject(ddlCommands)

endTransaction() # Postgres automatically frees TopTransactionContext

After Citus 11.3, with the new alternative metadata sync (below): releasing memory after syncing each object

startTransaction()
txContext = createTransactionContext() # create our own memory context
switchIntoContext(txContext) # switch to our own context

for obj in allObjects:
        ddlCommands = generateDDLsForObject(obj)
        syncObject(ddlCommands)
        txContext.freeMemory() # we free memory without waiting for the end of transaction

switchIntoContext(TopTransactionContext) # back to Postgres TopTransactionContext
endTransaction()

How did we prevent Postgres' hard memory error?

In the scenario before Citus 11.3 where you had a very large number of distributed tables in your Citus cluster, metadata sync used to send many DDL commands to workers inside a single transaction. Having many DDL commands in a single transaction like this can lead to a common and frustrating issue related to a hard memory limit caused by performing many DDL commands inside a single transaction.

An example of the error thrown by Postgres in this situation is below (this is different from the "out of memory" error shown earlier in the post)

SELECT citus_add_node(<ip>, <port>);
ERROR:  invalid memory alloc size 1073741824

SELECT start_metadata_sync_to_all_nodes();
ERROR:  invalid memory alloc size 1073741824

Explanation of the ERROR: Postgres creates a cache area in heap memory for each session to avoid lock contention overhead when accessing shared buffers. Each session stores cache invalidation messages for any modification it makes to any table in a local queue and then broadcasts the messages to other concurrent sessions after committing the changes. This allows others to refresh their caches and prevent them from accessing stale data.

The problem arose during metadata sync when executing many DDL commands on worker nodes, because Postgres imposes a hard limit on the memory usage by invalidation messages. Since it is expected that some clusters may have thousands of objects in metadata, they may not be able to add a new worker or upgrade beyond Citus 11.0 to scale reads and writes due to the memory error.

An alternative metadata sync added to 11.3 as an option

A new metadata sync mode, known as the non-transactional sync mode, has been introduced to address Postgres's hard memory error. With this new/optional mode, Citus completes metadata sync with multiple transactions rather than a large single transaction.

The default metadata sync mode is still the original metadata sync, what we call the "single transaction" mode. And if you do not have thousands of distributed objects in your cluster, you should never have to choose the new, non-transactional mode. However, if Citus 11.3 or later encounters the hard memory error during metadata sync, you can now enable the new non-transactional mode to solve the problem.

Here is an example action to switch into the non-transactional mode after hard memory error during metadata sync while adding a new node to the cluster:

SELECT citus_add_node(<ip>, <port>);
ERROR:  invalid memory alloc size 1073741824
SET citus.metadata_sync_mode TO 'nontransactional';
SELECT citus_add_node(<ip>, <port>);

Note: If an error occurs in the middle of the new, non-transactional metadata sync, it could leave the cluster in an inconsistent state where some nodes have incomplete metadata. However, the non-transactional mode is designed to sync each object idempotently, which means you can safely rerun the sync to completion after any failure.

In the example below, the alternative non-transactional metadata sync did not complete due to a disconnected session—and you can see how you would then open a new session to run the metadata sync again to get it to complete:

SET citus.metadata_sync_mode TO 'nontransactional';
SELECT citus_add_node(<ip>, <port>);
<Session is disconnected>

<Open a new session>
SET citus.metadata_sync_mode TO 'nontransactional';
SELECT citus_add_node(<ip>, <port>);

TIP: It is always recommended to switch back into the default mode (single transaction mode) after a successful metadata sync, as shown below:

SET citus.metadata_sync_mode TO 'transactional';

Performance of the new alternative metadata sync

When implementing 11.3, we anticipated that the alternative metadata sync (non-transactional mode) would be slower than the default/original mode. Why? Because the alternative mode generates a transaction for each distributed object during synchronization, with each transaction causing additional write-ahead-log (WAL) IO. Even though our primary objective with this release is to eliminate any memory issues during metadata sync, we did not want to degrade the performance of the default metadata sync mode.

So our goal was: The non-transactional mode should perform in an acceptable period so that it can be functional. With this in mind, we conducted a simple benchmark to confirm our expectations.

Here is our benchmark result on an Azure Cosmos DB for PostgreSQL cluster (aka Citus on Azure, formerly called Hyperscale (Citus)) with 2 memory optimized workers, each with a shared buffer size of 32GB, containing 15.000 distributed tables and 480.000 shards:

Sync Mode and Citus Version	Duration(min)
Metadata sync before Citus-11.3	20:20.655
Metadata sync after Citus-11.3	11:47.172
Alternative Non-transactional sync in Citus-11.3	23:25.387

Future plans & possible improvements

As of 11.3, Citus does not have any rules or heuristics in place to detect a metadata sync failure and automatically switch to the alternative, non-transactional mode. Implementing such rules or smarter heuristics could prevent unnecessary WAL writes and improve the total sync time
Execution time is not yet optimized for non-transactional sync. We aim at lowering total execution time for it

Corner cases and potential issues

By default in Citus 11.3, metadata sync mode is transactional
You will only need to switch into the non-transactional sync mode if the default mode fails due to memory failures
It is recommended to switch back into the default mode after a successful metadata sync
If any error occurs during the non-transactional sync, it could lead to an inconsistent state in the cluster until metadata sync runs successfully. To monitor a node that has not been synced, check the metadatasynced flag in pg_dist_node
The non-transactional mode is designed to be idempotent, which means you can run it multiple times, even if the operation fails in the middle for some reason
The non-transactional mode cannot be executed inside a transaction block
The non-transactional mode is expected to be slower than the default mode since each transaction causes WAL (write-ahead-log) writing IO

To learn more about 11.3 and the non-transactional metadata sync:

And to get started distributing PostgreSQL with the Citus extension:

Written by Aykut Bozkurt

Former computer engineer on the Citus team at Microsoft, MsC in Parallel Programming from Bogazici University. Fond of distributed systems. Likes any kind of sports.