v11.1 (Sep 2022)

What’s new in Citus 11.1?

Welcome to the 11.1 release of the Citus database. This “Updates” page dives deep into many of the changes in Citus 11.1 extension to PostgreSQL, including these headline features:

Read the release blog post: After reviewing this Updates page, if you want to read more about Citus 11.1, Marco’s blog post gives a good big-picture overview.

Citus 11.1 release party: A few of the Citus database engineers on our team at Microsoft livestreamed the 2nd-ever “Release Party”, to discuss (informally) what’s new in this 11.1 release and to show you 3 new, live demos of what you can now do. Watch the replay on YouTube.

create_distributed_table_concurrently: distribute tables without blocking writes

One of the superpowers of Citus is creating distributed tables using create_distributed_table. Prior to Citus 11.1, the operation was interrupting the application by blocking modifications (DML commands) on the table. PR #6087 introduces create_distributed_table_concurrently which avoids downtime for both reads and writes.

create_distributed_table_concurrently uses logical replication (and shard splits) as the underlying infrastructure. While moving data with logical replication, if the underlying table doesn’t have a primary key or unique index, UPDATE and DELETE operations would fail. The same limitation applies to create_distributed_table_concurrently. REPLICA IDENTITY is used to uniquely identify the rows which are updated or deleted. If the table undergoing distribution does not have a REPLICA IDENTITY, then concurrent UPDATE andDELETE would fail but INSERT would succeed.

There are few things to remember regarding create_distributed_table_concurrently:

  • If the table undergoing distribution does not have a REPLICA IDENTITY, then concurrent UPDATE and DELETE would fail but INSERT would succeed.
CREATE TABLE test(tenant_id int);
SELECT create_distributed_table_concurrently('test','tenant_id');
NOTICE:  relation test does not have a REPLICA IDENTITY or PRIMARY KEY
DETAIL:  UPDATE and DELETE commands on the relation will error out during create_distributed_table_concurrently unless there is a REPLICA IDENTITY or PRIMARY KEY. INSERT commands will still work.
 create_distributed_table_concurrently
---------------------------------------------------------------------

(1 row)
  • create_distributed_table_concurrently cannot be executed within transaction block.
CREATE TABLE test(tenant_id int);
BEGIN;
SELECT create_distributed_table_concurrently('test','tenant_id');
ERROR:  create_distributed_table_concurrently cannot run inside a transaction block
ROLLBACK;
  • create_distributed_table_concurrently does not support foreign keys to other tables
CREATE TABLE regular_table(id int);
CREATE TABLE dist_table(id int primary key);
ALTER TABLE regular_table add constraint fkey foreign key (id) references dist_table(id);
select create_distributed_table_concurrently('dist_table', 'id');
ERROR:  relation dist_table is referenced by a foreign key from regular_table
  • No concurrent executions of multiple create_distributed_table_concurrently.

Online tenant isolation

Citus already has the ability to isolate tenants to separate shards (see details) but prior to Citus 11.1, concurrent modifications were blocked during the operation. With the Pull Request #6167 in 11.1, we now support the isolation of a tenant without blocking any concurrent modifications. As of 11.1, we also switched using the shard splits as the underlying infrastructure. Thus, the current API for tenant isolation adds an extra parameter of shard transfer mode which affects the concurrent modification mode. It supports 3 different modes:

  • block_writes: As the name suggests, when shard transfer mode is specified as ‘block_writes’, isolation of the tenant blocks any concurrent modification.
CREATE TABLE test(tenant_id int);
SELECT create_distributed_table('test', 'tenant_id');
SELECT isolate_tenant_to_new_shard('test', 5, shard_transfer_mode => 'block_writes');
  • force_logical: When shard transfer mode is specified as ‘force_logical’, any concurrent modification can be executed during tenant isolation. It internally uses Postgres’ logical replication. In case there is no primary key or replica identity defined for the table, concurrent deletes and updates to the isolated tenant fail even if concurrent inserts to it still succeed.
CREATE TABLE test(tenant_id int);
SELECT create_distributed_table('test', 'tenant_id');
SELECT isolate_tenant_to_new_shard('test', 5, shard_transfer_mode => 'force_logical');
  • auto (default mode): This mode operates exactly the same with ‘force_logical’ except that it fails if the table does not have any primary key or replica identity.
CREATE TABLE test(tenant_id int);
SELECT create_distributed_table('test', 'tenant_id');
SELECT isolate_tenant_to_new_shard('test', 5, shard_transfer_mode => 'auto');

Split Shards

A distributed table is created with a fixed number of shards count to start with. Over time with changes in workload and storage growth, some of you may want the flexibility to split an existing shard range and redistribute the data to new shards. Furthermore, doing this is in a non-blocking fashion (allowing DML on the shard) and controlling the placement of the new split shards is critical. This is now possible with the new citus_split_shard_by_split_points UDF added in Citus 11.1.

Example

CREATE TABLE test(key int);
SET citus.shard_count TO 4;
SELECT create_distributed_table('test', 'key');

Select * from pg_dist_shard;
 logicalrelid | shardid | shardstorage | shardminvalue | shardmaxvalue
--------------+---------+--------------+---------------+---------------
 test         |  102059 | t            | -2147483648   | -1073741825
 test         |  102060 | t            | -1073741824   | -1
 test         |  102061 | t            | 0             | 1073741823
 test         |  102062 | t            | 1073741824    | 2147483647
(4 rows)

SELECT * from citus_split_shard_by_split_points(
    -- shard id to split, see pg_dist_shard for shard ids
    102061,

    -- split point(s) in the shard range, see pg_dist_shard for shard ranges
    ARRAY['10000000', '536870911'],

    -- node ids to place the new shards, see pg_dist_node to see the nodeids
    ARRAY[1, 2, 1],

    -- split mode, see below
    'force_logical'
);

select * from pg_dist_shard;
 logicalrelid | shardid | shardstorage | shardminvalue | shardmaxvalue
--------------+---------+--------------+---------------+---------------
 test         |  102059 | t            | -2147483648   | -1073741825
 test         |  102060 | t            | -1073741824   | -1
 test         |  102062 | t            | 1073741824    | 2147483647
 test         |  102063 | t            | 0             | 10000000
 test         |  102064 | t            | 10000001      | 536870911
 test         |  102065 | t            | 536870912     | 1073741823

In the example above, Parent shard with shard id ‘102061’ and hash range (0, 1073741823) was split to 3 shards:

  1. ‘102063’ with hash range(0, 10000000)

  2. ‘102064’ with hash range (10000001, 536870911)

  3. ‘102065’ with hash range (536870912, 1073741823)

The new shards are placed on the nodes with ids 1, 2 and 1 respectively, see pg_dist_node for node ids. split_mode has the same semantics with rebalancer’s shard_transfer_mode.

Notable pull requests for Citus shard split

  1. PR 6029 This PR introduces the citus_split_shard_by_split_points UDF with a blocking split implementation. A new internal Copy API worker_shard_copy is added to facilitate push based local/remote data copy.

  2. PR 6080 This PR adds support for non-blocking split in citus_split_shard_by_split_points UDF. This is done by leveraging Postgre’s logical replication infrastructure to stream ongoing DMLs from the splitted shard to the new shards.

PostgreSQL 15 Beta4 support—and ready for PostgreSQL 15

Because the Citus database is implemented as an extension to Postgres, as the Postgres community starts tagging new major versions, Citus team starts integrating the Citus base to work with the new major version. Therefore, Citus 11.1 is compatible with the Postgres 15 beta4 release and is ready for Postgres 15! (As soon as the Postgres 15 GA is released, we plan to create a patch release to Citus 11.1 to support PG15 GA.)

There are many new exciting features in this Postgres major version. You can read about the complete set of these features in the Postgres release notes here. A lot of improvements on query performance, replication, monitoring, etc. just work with Citus, because Citus only needs to learn how to handle SQL interface changes.

It’s out of scope to go over all PG15 features in this blog, hence we will focus on some more prominent ones. With some integration work, the following cool PG15 features are supported in Citus 11.1:

If you want to learn about Citus support for all PG15 features in detail, you can see this issue.

UNIQUE NULLS NOT DISTINCT

Previously NULL values were always indexed as distinct values, but this can now be changed by creating constraints and indexes using UNIQUE NULLS NOT DISTINCT on distributed tables:

-- create a table with two constraints:
-- the first constraint treating NULLS as distinct
-- the second constraint treating NULLS as not distinct
CREATE TABLE nulls_test (
  key int,
  m int,
  n int,
  UNIQUE NULLS DISTINCT (key, m), -- same as UNIQUE (key, m)
  UNIQUE NULLS NOT DISTINCT (key, n)
);

-- distribute the table
SELECT create_distributed_table('nulls_test', 'key');

-- populate the table
INSERT INTO nulls_test
  VALUES
  (1, 1, 1), -- entry with no NULLs
  (1, NULL, 2), -- entry with a NULL for testing NULLs DISTINCT constraint
  (1, 2, NULL); -- entry with a NULL for testing NULLS NOT DISTINCT constraint
INSERT 0 3

-- previously NULL values were always distinct
-- entering the same entry as above for testing NULLS DISTINCT constraint
-- new entry is allowed
INSERT INTO nulls_test VALUES (1, NULL, 3);
INSERT 0 1

-- now we can specify NULLS NOT DISTINCT as above
-- entering the same entry for testing it
-- this entry will not be allowed in the corresponding shard
INSERT INTO nulls_test VALUES (1, 3, NULL);
ERROR:  duplicate key value violates unique constraint "nulls_test_key_n_key_{shard_id}"
DETAIL:  Key (key, n)=(1, null) already exists.

Unlogged sequences

If specified, now distributed sequences can be created as unlogged sequences. Changes to unlogged sequences are not written to the write-ahead log. This means they are not crash safe. Unlogged sequences do not offer significant performance improvement, but it makes much more sense to have an unlogged identity/serial column sequence if its associated distributed table is unlogged as well:

-- create an unlogged table with two serial columns
CREATE UNLOGGED TABLE unlogged_sequences_test(key bigserial, a serial);

-- distribute the table
-- serial column sequences will be automatically distributed as well
SELECT create_distributed_table('unlogged_sequences_test', 'key');

-- the serial column sequences are unlogged in this node
SELECT relname,
       CASE relpersistence
            WHEN 'u' THEN 'unlogged'
            WHEN 'p' then 'logged'
        END AS logged_info
FROM pg_class
WHERE relname IN ('unlogged_sequences_test_key_seq', 'unlogged_sequences_test_a_seq');

          relname                | logged_info
---------------------------------+-------------
 unlogged_sequences_test_key_seq | unlogged
 unlogged_sequences_test_a_seq   | unlogged
(2 rows)

\c - - - :another_node_port
-- the serial column sequences are also unlogged in another node
SELECT relname,
       CASE relpersistence
            WHEN 'u' THEN 'unlogged'
            WHEN 'p' then 'logged'
        END AS logged_info
FROM pg_class
WHERE relname IN ('unlogged_sequences_test_key_seq', 'unlogged_sequences_test_a_seq');

          relname                | logged_info
---------------------------------+-------------
 unlogged_sequences_test_key_seq | unlogged
 unlogged_sequences_test_a_seq   | unlogged
(2 rows)

The following new PG15 features are not YET supported in Citus 11.1

We are working on integrating these Postgres 15 features into Citus for future releases:

  • MERGE command
  • CLUSTER command on partitioned distributed tables
  • ALTER TABLE .. SET ACCESS METHOD command

Shard rebalancer performance

This release improves the speed of the shard rebalancer, both for the blocking and the non-blocking one. Depending on the situation it can be up to twice as fast (and sometimes even more). If you have big shards that need to moved to a new node, this could sometimes take a long time. And while it’s possible that the rebalancer operation is slowed due to limitations of your hardware, such as the speed of your disk and network. But it turned out there were some performance improvements we could make in the way we moved shards as well. For most of these speed improvements you don’t need to do anything, except upgrade to Citus 11.1. But one improvement requires changes to use:

Shard Rebalancer: Increasing CPU Priority of logical replication

When moving a shard from a source node to a target node using the non-blocking rebalancer there are two main phases:

  1. The copy phase: Use a COPY command to copy a snapshot of the data to the target node.
  2. The catchup phase: Uses logical replication to replay all the changes on the target node that happened on the source node while the copy phase was happening. While doing catchup new changes will continue to happen as well, so we continue catching up until the target has applied all the new changes.

Under some workloads the catchup phase would take extremely long. This happened when moving a shard from a node that is also handling lots of queries at the same time. These queries on the source made changes to the data with roughly the same speed as the speed with which the target was replaying them, and thus no real progress was being made during the catchup.

As it turned out, the reason this happened was because the process on the source wasn’t able to get enough CPU time to send the changes to the target. To fix that we added a new setting that allows you to give more CPU priority to the logical replication sender processes. The default priority of a process is 0 and the lower the value of the priority, the more CPU time it gets. So if you have this issue, you should try changing the following setting on all the nodes in your cluster to -10 (the lowest value you can use is -20, but that should not be necessary and it might actually interfere with other important processes on your machine):

citus.cpu_priority_for_logical_replication_senders = -10

Changing just this setting is not enough though, you also need to change some of your OS settings. To actually make this setting do what it’s supposed to, it’s important to run Postgres with a more permissive value for the ‘nice’ resource limit than Linux will do by default. By default Linux will not allow a process to set its priority to a lower value than it currently is. This needs to be done on all the nodes in your cluster.

If you use systemd to start Postgres all you have to do is add a line like this to the systemd service file:

LimitNice=-10

If you’re not using systemd then you’ll have to configure /etc/security/limits.conf like so, assuming that you are running Postgres as the postgres OS user:

postgres            soft    nice            -10
postgres            hard    nice            -10

Finally (if you’re not using systemd) you’d have add the following line to /etc/pam.d/common-session:

session required pam_limits.so

Shard rebalancer observability

In [Citus 10.1 we greatly improved the get_rebalance_progress function, but it only worked for shard moves that were initiated indirectly by rebalance_table_shards(). In this release we improve it in two more ways:

  1. It now also reports progress for shard moves that are triggered outside of the rebalancer. This is important for the background rebalancer, so you can also track progress of moves initiated by the newly added citus_rebalance_start() function. It can also be useful if you’re moving shards manually when debugging an issue.
  2. It now reports progress of shard copies in addition to shard moves. This is very useful when reference tables need to be copied to newly added nodes, when using citus.replicate_reference_tables_on_activate = off. If you have large reference tables, this might take a long time and before Citus 11.1 there was no way of getting insight into the progress being made. Now simply calling get_rebalance_progress() will show you how the data copy is going.

Background Shard rebalancer

Running the rebalancing functions in Citus can take quite some time, depending on the data sizes. And even though we kept optimizing the speed of shard moves, there are limitations to how fast a rebalance can be. And due to the transactional nature of DDL commands in Postgres, you need to stay connected to the session executing the rebalance functions during a rebalance. Over time this has shown to be a usability challenge.

Prior to Citus 11.1, any disruption in the session running a rebalance would cause the rebalance to stop, and even though we didn’t completely roll back all moves, the move that was in progress will lose all the work performed already during such a disruption. For this reason many of you would actually run these operations from a separate machine that is closer to your data, with a more reliable connection. This minimizes the session to be interrupted due to network failures. However, it is a place that would need separate monitoring to make sure that the rebalance keeps running.

With Citus 11.1 we introduce 3 new functions to invoke the rebalancer of Citus. - citus_rebalance_start() - citus_rebalance_stop() - citus_rebalance_wait()

citus_rebalance_start() being the first and probably most important one of them all. Contrary to rebalance_table_shards() this new function will return quickly if not instantaniously. Instead of planning and moving all shards in the current session citus_rebalance_start() will do the planning and hand off the execution to a background worker. Internally the planned moves will be written into a catalog table as separate tasks to be executed. The Citus Maintenance Daemon will keep an eye on these tasks, and starts other backends when there are moves to be executed.

The biggest benefit will be that now you don’t need to keep your session connected, meaning you can initiate a rebalance operation from anywhere, disconnect and be sure that it will run to completion. No need to provision new machines or otherwise run this previously long running tasks.

Besides executing the moves in the background, Citus will also retry a move that intermittently failed. This could happen when the cluster experiences presure on memory, disk or network. In the past any such occurance would cause the rebalance to stop until the user would notice the failure and restart. Now that Citus executes these moves in the background, such a temporary failure could be retried.

All of this changes how you would stop the rebalancer when it is in progress. If for any reason you need to stop the rebalancer, you can’t simply cancel your running query. Simply cancelling the session that executes the move will be seen as a failure, and after some backoff period the move will be restarted. Instead of cancelling a running backend we provide you with a new function, citus_rebalance_stop(), to stop a running rebalance. Besides cancelling the backend that executes the move it will also mark the current task, and all future tasks of a rebalance as cancelled.

While the rebalance is running you can look into its progress with the improved shard rebalancer observability!

Although it is a powerful feature to not have your session wait till the rebalancer finishes there are situations where it helps automation if you could. For this purpose there is one last feature added to this new rebalancing experience; citus_rebalance_wait(). This is a simple function that will block the session until all the tasks of a rebalance have been finished. You should still be aware that, when using this for automation, that a full rebalance could take a significant amount of time. For this to work you will need to make sure to stay connected to the system. However, it is easy to recover if you got disconnected. Simply run citus_rebalance_wait() again. If the rebalance has finished you will get a notice telling you there is no rebalance running, and can continue your automation.

These changes will take away the rough edges that we have seen with the biggest users and should help you be more productive, quicker with operating your database.

Citus Columnar

Citus 10.0 brought columnar compression to Postgres. Citus 11.1 moves the columnar features into their own extension. This means you can choose to load only the citus_columnar extension or drop the citus_columnar extension if you do not need columnar features. First, make sure that citus_columnar is in the shared_preload_libraries if you want to use only columnar features of Citus:

show shared_preload_libraries ;
┌──────────────────────────┐
 shared_preload_libraries 
├──────────────────────────┤
 citus_columnar           
└──────────────────────────┘
(1 row)

Then, you can create citus_columnar and use columnar-only features of Citus:

CREATE EXTENSION citus_columnar;

-- create and use columnar tables without distributing them
CREATE TABLE simple_columnar(i INT8) USING columnar;
INSERT INTO simple_columnar SELECT generate_series(1,100000);
SELECT AVG(i) FROM simple_columnar;
┌────────────────────┐
        avg         
├────────────────────┤
 50000.500000000000 
└────────────────────┘
(1 row)

As an outcome of this separation, if you create the citus extension, now you are also going to see citus_columnar in the list of extensions:

 CREATE EXTENSION citus;

 -- list the extensions installed, you are going to see both citus and citus_columnar
 \dx
                      List of installed extensions
┌────────────────┬─────────┬────────────┬──────────────────────────────┐
      Name       Version    Schema            Description          
├────────────────┼─────────┼────────────┼──────────────────────────────┤
 citus           11.1-1   pg_catalog  Citus distributed database   
 citus_columnar  11.1-1   pg_catalog  Citus Columnar extension     
 plpgsql         1.0      pg_catalog  PL/pgSQL procedural language 
└────────────────┴─────────┴────────────┴──────────────────────────────┘
(3 rows)

Deprecated Features

Notable Fixes

  • Pull request #6059: Fixes a bug that could cause failures in insert into select
  • Pull request #6288: Fixes a bug that prevented SELECT .. FOR UPDATE to open a transaction block when it was the first statement in function
  • Pull request #6231: Fixes an issue that could cause logical reference table replication to fail
  • Pull request #6173: Fix reference table lock contention for concurrent create and drop reference tables
  • Pull request #6279: Fixes a bug that caused non-blocking shard moves to fail when a partitioned table had a foreign key to reference table