POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Aykut Bozkurt
June 16, 2023
One of the most important improvements in Citus 11.3 is that Citus offers more reliable metadata sync. Before 11.3, when a Citus cluster had thousands of distributed objects (such as distributed tables), Citus occasionally experienced memory problems while running metadata sync. Due to these memory errors, some users with very large numbers of tables were sometimes unable to add new nodes or upgrade beyond Citus 11.0.
To address the memory issues, we added an alternative "non-transactional" mode to the current metadata sync in Citus 11.3.
The default mode for metadata sync is still the original single transaction mode that we introduced in Citus 11.0. But now in 11.3 or later, if you have a very large number of tables and you run into the memory error, you can choose to optionally switch to the non-transactional mode, which syncs the metadata via many transactions. While most of you who use Citus will not need to enable this alternative metadata sync mode, this is how to do it:
SET citus.metadata_sync_mode TO 'nontransactional';
It is also recommended to switch back to the default mode after a successful metadata sync like below:
SET citus.metadata_sync_mode TO 'transactional';
A common use case in which you might encounter memory problems during metadata sync is when you are using Citus for its excellent time-series capabilities, including time-partitioning and seamless partition scaling. With time series data and distributed PostgreSQL, you often end up with tens of thousands of tables containing hundreds of thousands of shards, resulting in an excessive number of distributed objects that need to be synchronized across worker nodes. This large-scale sync process can lead to Postgres memory problems during the Citus metadata sync.
This blog post will cover the fundamentals of Citus metadata synchronization, explore some of the memory problems related to it, and finally explain how we solved the problems in Citus 11.3:
NOTICE: This post contains some technical details on how we made metadata sync more reliable. Hence "diary of an engineer" in the the title of this blog post. If you're a Citus user and don't need the implementation details, feel free to skip sections 3 and 4.
First a bit of history. The 11.0 open source release of the Citus extension to Postgres is groundbreaking in that it eliminates the bottleneck of queries being served only from the coordinator. After 11.0, all nodes in the cluster can be utilized to their full potential, enabling any insert, update, or select query to be served by any node. This "query from any node" capability is made possible by Citus syncing the meta-information about the cluster from the coordinator to all other nodes in the Citus database cluster.
Prior to 11.0, the metadata only existed on the coordinator. Then, starting in 11.0, Citus would sync the metadata from the coordinator to all nodes. You can consider the metadata—which is shared among all nodes—as the source of truth about the cluster. For example, a distributed table is divided into multiple shards, with 32 being the default. The number of shards is a part of the metadata. How does Citus find the location of a shard? The Citus metadata tells us!
There are 2 situations where Citus runs a metadata sync:
citus_add_node
: Whenever you choose to add new nodes to the cluster to scale reads and writes, Citus will sync the metadata.citus_sync_metadata_to_all_nodes
: Whenever you upgrade your Citus version—starting from Citus 11.0 or later—Citus will automatically sync the metadata at the coordinator to all active workers.Prior to 11.3, when a Citus cluster contained very large numbers of distributed objects, particularly numerous partitions, running out of memory used to be highly probable during metadata sync. Below is how it looks like when Postgres throws a memory error since the metadata sync consumes a lot of memory:
SELECT citus_add_node(<ip>, <port>);
ERROR: out of memory
SELECT start_metadata_sync_to_all_nodes();
ERROR: out of memory
Citus uses Postgres's memory allocation and deallocation API, which is referred to as MemoryContext
.
TopTransactionContext
, which is utilized for memory allocations following the start of a transaction. For metadata sync to succeed, Citus must generate and propagate the necessary commands (DDLs) for each distributed object from the coordinator to the workers to complete the object's synchronization:
TopTransactionContext
may quickly become too large, leading to the OOM killer
killing the session or an out of memory error
being thrown by Postgres.TopTransactionContext
at the end of the transaction, this may be too late for metadata sync to succeed without encountering a memory crash or error.To solve this memory issue in 11.3, instead of waiting for the end of the transaction to free the memory that is allocated during metadata sync, we now release the memory after syncing each object.
Before Citus 11.3 (below): waiting until the end of the transaction to free all allocated memory during metadata sync
startTransaction() # all memory allocation automatically done at TopTransactionContext
for obj in allObjects:
ddlCommands = generateDDLsForObject(obj)
syncObject(ddlCommands)
endTransaction() # Postgres automatically frees TopTransactionContext
After Citus 11.3, with the new alternative metadata sync (below): releasing memory after syncing each object
startTransaction()
txContext = createTransactionContext() # create our own memory context
switchIntoContext(txContext) # switch to our own context
for obj in allObjects:
ddlCommands = generateDDLsForObject(obj)
syncObject(ddlCommands)
txContext.freeMemory() # we free memory without waiting for the end of transaction
switchIntoContext(TopTransactionContext) # back to Postgres TopTransactionContext
endTransaction()
In the scenario before Citus 11.3 where you had a very large number of distributed tables in your Citus cluster, metadata sync used to send many DDL commands to workers inside a single transaction. Having many DDL commands in a single transaction like this can lead to a common and frustrating issue related to a hard memory limit caused by performing many DDL commands inside a single transaction.
An example of the error thrown by Postgres in this situation is below (this is different from the "out of memory" error shown earlier in the post)
SELECT citus_add_node(<ip>, <port>);
ERROR: invalid memory alloc size 1073741824
SELECT start_metadata_sync_to_all_nodes();
ERROR: invalid memory alloc size 1073741824
Explanation of the ERROR: Postgres creates a cache area in heap memory for each session to avoid lock contention overhead when accessing shared buffers. Each session stores cache invalidation messages for any modification it makes to any table in a local queue and then broadcasts the messages to other concurrent sessions after committing the changes. This allows others to refresh their caches and prevent them from accessing stale data.
The problem arose during metadata sync when executing many DDL commands on worker nodes, because Postgres imposes a hard limit on the memory usage by invalidation messages. Since it is expected that some clusters may have thousands of objects in metadata, they may not be able to add a new worker or upgrade beyond Citus 11.0 to scale reads and writes due to the memory error.
A new metadata sync mode, known as the non-transactional sync mode, has been introduced to address Postgres's hard memory error. With this new/optional mode, Citus completes metadata sync with multiple transactions rather than a large single transaction.
The default metadata sync mode is still the original metadata sync, what we call the "single transaction" mode. And if you do not have thousands of distributed objects in your cluster, you should never have to choose the new, non-transactional mode. However, if Citus 11.3 or later encounters the hard memory error during metadata sync, you can now enable the new non-transactional mode to solve the problem.
Here is an example action to switch into the non-transactional mode after hard memory error during metadata sync while adding a new node to the cluster:
SELECT citus_add_node(<ip>, <port>);
ERROR: invalid memory alloc size 1073741824
SET citus.metadata_sync_mode TO 'nontransactional';
SELECT citus_add_node(<ip>, <port>);
Note: If an error occurs in the middle of the new, non-transactional metadata sync, it could leave the cluster in an inconsistent state where some nodes have incomplete metadata. However, the non-transactional mode is designed to sync each object idempotently, which means you can safely rerun the sync to completion after any failure.
In the example below, the alternative non-transactional metadata sync did not complete due to a disconnected session—and you can see how you would then open a new session to run the metadata sync again to get it to complete:
SET citus.metadata_sync_mode TO 'nontransactional';
SELECT citus_add_node(<ip>, <port>);
<Session is disconnected>
<Open a new session>
SET citus.metadata_sync_mode TO 'nontransactional';
SELECT citus_add_node(<ip>, <port>);
TIP: It is always recommended to switch back into the default mode (single transaction mode) after a successful metadata sync, as shown below:
SET citus.metadata_sync_mode TO 'transactional';
When implementing 11.3, we anticipated that the alternative metadata sync (non-transactional mode) would be slower than the default/original mode. Why? Because the alternative mode generates a transaction for each distributed object during synchronization, with each transaction causing additional write-ahead-log (WAL) IO. Even though our primary objective with this release is to eliminate any memory issues during metadata sync, we did not want to degrade the performance of the default metadata sync mode.
So our goal was: The non-transactional mode should perform in an acceptable period so that it can be functional. With this in mind, we conducted a simple benchmark to confirm our expectations.
Here is our benchmark result on an Azure Cosmos DB for PostgreSQL cluster (aka Citus on Azure, formerly called Hyperscale (Citus)) with 2 memory optimized workers, each with a shared buffer size of 32GB, containing 15.000 distributed tables and 480.000 shards:
Sync Mode and Citus Version | Duration(min) |
---|---|
Metadata sync before Citus-11.3 | 20:20.655 |
Metadata sync after Citus-11.3 | 11:47.172 |
Alternative Non-transactional sync in Citus-11.3 | 23:25.387 |
metadatasynced
flag in pg_dist_node
To learn more about 11.3 and the non-transactional metadata sync:
And to get started distributing PostgreSQL with the Citus extension: