What’s new in Citus 11.2 for Postgres, plus Patroni HA support for Citus

Written by Marco Slot
February 8, 2023

Our goal for the Citus extension is for you to be able to use all PostgreSQL features at any scale, with a seamless scaling experience. Distributed tables (or more generally “Citus tables”) are a powerful tool to get high performance at any scale. There are only a few remaining limitations when distributing a PostgreSQL table, but we are determined to solve them all. The Citus 11.2 release checks off another five SQL & DDL features that now work seamlessly on Citus tables. We also improved progress tracking for the shard rebalancer, so you know exactly what’s going on when rebalancing your cluster.

We also want PostgreSQL tools to work out-of-the-box even if you have a distributed PostgreSQL cluster. One of the most frequent questions we get on the Citus Slack from our open source users is how to set up high availability. Alexander Kukushkin, who is the primary maintainer of Patroni and recently joined the Citus database engine team, therefore developed a new version of Patroni which includes support for Citus!

Before we dive in, you can find detailed release notes for Citus 11.2 by the engineering team on our Updates page.

Since Citus is fully open source, you can always see everything we’re doing on the Citus Github repo. If you like Citus, we always appreciate a star ⭐!

Announcing Patroni 3.0 with built-in Citus support

The big news for users who want to run Citus outside of Azure is that the latest release of the most popular high availability (HA) solution for PostgreSQL, Patroni 3.0, comes with built-in Citus support!

With Citus and Patroni combined, you can now run the latest version of PostgreSQL in a way that’s horizontally scalable with high availability in any environment, and it’s all open source! We will share a detailed blog post on Patroni for Citus soon, but here is a quick rundown.

  • Patroni runs alongside your PostgreSQL servers on multiple nodes, sets up physical replication with a leader and one or more standby nodes, and ensures automatic failover between them. The leader node takes a lock via a consensus store like etcd, and if the leader node fails to renew the lock because it went down, then one of the standby nodes will automatically take over.
  • Citus support in Patroni works the same way, with each node having one or more hot standby nodes. Patroni will automatically update the Citus metadata when a switchover or failover occurs and gives you insights and control over your whole cluster.
postgres@coord1:~$ patronictl list demo
+ Citus cluster: demo ----------+--------------+---------+----+-----------+
| Group | Member  | Host        | Role         | State   | TL | Lag in MB |
+-------+---------+-------------+--------------+---------+----+-----------+
|     0 | coord1  | 172.27.0.10 | Replica      | running |  1 |         0 |
|     0 | coord2  | 172.27.0.6  | Sync Standby | running |  1 |         0 |
|     0 | coord3  | 172.27.0.4  | Leader       | running |  1 |           |
|     1 | work1-1 | 172.27.0.8  | Sync Standby | running |  1 |         0 |
|     1 | work1-2 | 172.27.0.2  | Leader       | running |  1 |           |
|     2 | work2-1 | 172.27.0.5  | Sync Standby | running |  1 |         0 |
|     2 | work2-2 | 172.27.0.7  | Leader       | running |  1 |           |
+-------+---------+-------------+--------------+---------+----+-----------+

Citus 11.2 comes with improvements to make worker node switchover in Patroni extra smooth, but Patroni 3.0 is compatible with Citus 10.0 and above.

Update in Mar 2023: Since first publishing this Citus 11.2 post, Alexander Kukushkin has written this deep dive of a blog post that explains more, titled: Patroni 3.0 & Citus: Scalable, Highly Available Postgres.

For Azure users, Azure Cosmos DB for PostgreSQL will continue to be the easiest way to get a Citus cluster with high availability, backups, read replicas in any region, configuration & role management, upgrades, monitoring, and experts at Microsoft who carry the pager if something were to go wrong. Patroni 3.0 will make it easier to run Citus in production outside of Azure by yourself, or even on Azure if you have very specialized requirements (e.g. only use fast ephemeral storage, proprietary extensions).

Citus on Kubernetes

As part of Patroni 3.0, it also becomes a lot easier to run Citus on Kubernetes, which is one of our most popular feature requests.

An example of setting up a Citus cluster with a coordinator and 2 worker nodes, each with a hot standby and automatic failover, on your existing Kubernetes cluster:

git clone https://github.com/zalando/patroni.git
cd patroni/kubernetes/
kubectl apply -f citus_k8s.yml

Before moving into production, you’ll want to carefully tweak things like networking, authentication, node size, node affinity, and storage—but Patroni now gives you an easy way to try Citus on Kubernetes and a good starting point for more advanced set ups.

Improved SQL & DDL support for Citus tables

The other big news in the Citus 11.2 release is that we tackled some of the most frequently requested SQL and DDL support improvements for Citus tables.

Citus enables you to scale your PostgreSQL workload by using distributed- and reference tables, together referred to as “Citus tables”. Our goal with Citus is to make every PostgreSQL feature work seamlessly on Citus tables. That means every PostgreSQL feature (and there are a lot) should have a distributed implementation. I would say the feature compatibility of Citus tables is already >98%, but that still leaves many individual features that we need to flesh out.

In Citus 11.2, we focused on tackling some of the top Postgres feature requests:

Improved support for outer joins between reference & distributed tables

A particularly challenging type of SQL query in a distributed database is an outer join that has a reference table, or what our developers refer to as “recurring tuples”, on the outer side of an outer join, and a distributed table on the inner side.

Outer joins always include all records from the “outer” side. For a LEFT JOIN, the outer table is the left side of the join, for a RIGHT JOIN it is the right side, and for a FULL JOIN it is both sides. All the rows from the outer side of the join have at least one corresponding row in the result of the join unless they are filtered out by the WHERE clause.

A JOIN B
Figure 1: A LEFT JOIN B - A is the “outer” side and B is the “inner” side, all of A is included

Normally, when you join a distributed table and a reference table, Citus joins each individual shard of the distributed table with the reference table and aggregates the results. This works well for inner joins because we only get back reference table tuples that match with tuples in a shard, and there is no overlap between shards. It also works well for outer joins that have a distributed table on the outer side.

If we were to follow the same strategy for outer joins when a reference table is the outer table and a distributed table is the inner table, then reference table rows that do not match with the shard would be returned many times, once for each shard, and the result would be incorrect. Prior to Citus 11.2, Citus would error out for such queries, but we made improvements to the query planner in 11.2 that makes those queries “just work”.

Below is a full example of a LEFT JOIN using a reference table on the outer side, which now works seamlessly on 11.2.

-- set up an item_types table
create table item_types (type_id bigint generated by default as identity, type_name text);
select create_reference_table('item_types');
alter table item_types add primary key (type_id);
alter table item_types add unique (type_name);
insert into item_types (type_name) values ('regular');
insert into item_types (type_name) values ('special');

-- set up an items table
create table items (item_id bigint generated by default as identity, type_id bigint, item_name text);
select create_distributed_table('items', 'item_id');
alter table item_types add foreign key (type_id) references item_types (type_id);
insert into items (type_id, item_name) values (1, 'thing');

-- include all item_types, even if there is no match in items
select * from item_types left join items using (type_id);
┌─────────┬───────────┬─────────┬───────────┐
 type_id  type_name  item_id  item_name 
├─────────┼───────────┼─────────┼───────────┤
       1  regular          1  thing     
       2  special                       
└─────────┴───────────┴─────────┴───────────┘
(2 rows)

This SQL improvement not only applies to reference tables, but also to any distributed subplan, such as the ones Citus generates for subqueries or function calls.

-- number of events by hour for the past week, include hours without events
select hour_start, count(event.event_id)
from generate_series(now()-'1 week'::interval, now(), '1 hour') hour_start
left outer join (select * from events where created_at >= now()-'1 week'::interval) event
on created_at <@ tstzrange(hour_start, hour_start+'1 hour'::interval)
group by hour_start
order by hour_start;

An important note: Queries like the one above might transfer more data from the distributed table over the network than regular joins. Where possible, Citus pushes down filters and projections, but it can still be useful to add more filters into the distributed part of the join by writing a subquery. At very large scale, you may need to tweak the query structure and perhaps avoid the outer join.

Adding a constraint without a name

We also made improvements to DDL support for Citus tables: You can now add constraints without a name (or rather, with an auto-generated name) to Citus tables. Previously, we could not easily push down DDL commands that did not specify names to the shards of the Citus tables, because Postgres would generate a different name for every shard under the covers, in a way that was inconsistent with the way Citus names constraints on shards.

In Citus 11.2, we solved this by extending our “deparsing” logic for the DDL commands. Deparsing converts the parse tree of a DDL command back into a string. In this case, we put the auto-generated name in the parse tree and convert it back to a DDL command that we can run on each shard. That way, we can support adding constraints with auto-generated names on distributed tables:

CREATE TABLE customer_data (customer_id bigint not null, key text not null, value jsonb not null);
SELECT create_distributed_table('customer_data', 'customer_id');

-- Make sure the table has a primary key
ALTER TABLE customer_data ADD PRIMARY KEY (customer_id, key);
-- All our values should have a version set
ALTER TABLE customer_data ADD CHECK ((value ? 'version'));
-- Make sure the customer ID exists in the customers table
ALTER TABLE customer_data ADD FOREIGN KEY (customer_id) REFERENCES customers (id);
\d customer_data
             Table "public.customer_data"
┌─────────────┬────────┬───────────┬──────────┬─────────┐
   Column      Type   Collation  Nullable  Default 
├─────────────┼────────┼───────────┼──────────┼─────────┤
 customer_id  bigint             not null          
 key          text               not null          
 value        jsonb              not null          
└─────────────┴────────┴───────────┴──────────┴─────────┘
Indexes:
    "customer_data_pkey" PRIMARY KEY, btree (customer_id, key)
Check constraints:
    "customer_data_check" CHECK (value ? 'version'::text)
Foreign-key constraints:
    "customer_data_customer_id_fkey" FOREIGN KEY (customer_id) REFERENCES customers(id)

Of course, it is still good practice to explicitly name your constraints such that you can drop them later without having to first inspect the database to know the names.

For example:

ALTER TABLE customer_data ADD CONSTRAINT check_value CHECK ((value ? 'version'));
ALTER TABLE customer_data DROP CONSTRAINT check_value;

Identity columns

Citus has had support for sequences for a long time, and the common way to use sequences is to put bigserial in your CREATE TABLE statement. Recent versions of PostgreSQL also added support for “identity columns”, which is a more standards-compliant syntax for using sequences. While bigserial has remained far more popular among PostgreSQL users, we are starting to see some popular tools and frameworks like Django adopt the GENERATED .. AS IDENTITY as syntax, so we made it work for Citus tables.

CREATE TABLE items (
    item_id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    plain_old_sequence bigserial,
    value text
);
SELECT create_distributed_table('items ', 'item_id');
INSERT INTO items (value) VALUES ('hello');
INSERT INTO items (value) VALUES ('world');
SELECT * FROM items;
┌─────────┬────────────────────┬───────┐
 item_id  plain_old_sequence  value 
├─────────┼────────────────────┼───────┤
       1                   1  hello 
       2                   2  world 
└─────────┴────────────────────┴───────┘
(2 rows)

Identity columns work pretty much the same way as sequences on Citus tables, which also means there is a similar caveat: If you use the query from any node feature in Citus, the worker nodes will put their node group ID in the first 16 bits of the generated values to ensure uniqueness, which gives you quite high values. In addition, writes to identity columns with the smallint and int type are not supported from worker nodes.

DROP OWNED BY

Citus is an open source project developed by Microsoft, and of course, contributions are more than welcome. Two great patches by GitHub user songjinzhou in this release—for DROP OWNED BY and ALTER TABLE ... SET ACCESS METHOD—further contribute to the overall PostgreSQL compatibility in Citus.

When you want to drop a role in PostgreSQL that still owns certain database objects, you might see an error like:

DROP USER foo;
ERROR:  role "foo" cannot be dropped because some objects depend on it

If you do not want to keep any of the objects owned by the user, you can easily drop them all at once and then drop the user without issue:

DROP OWNED BY foo;
DROP USER foo;

The DROP OWNED BY command is now propagated to all the nodes by Citus.

ALTER TABLE … SET ACCESS METHOD …

PostgreSQL 15 added the SET ACCESS METHOD option, which you can use to switch between heap and columnar access methods in Citus.

Citus already had the alter_table_set_access_method function, which still has an important benefit: It allows reads to continue until it is done rewriting the table.

-- blocks reads and writes while compressing table
ALTER TABLE data SET ACCESS METHOD columnar;
-- only blocks writes during while compressing table
SELECT alter_table_set_access_method(data, columnar);

For the moment, we recommend you continue using the alter_table_set_access_method, but the new syntax is compatible with PostgreSQL and likely to be improved in the future.

Shard rebalancer progress tracking

Apart from full support for all PostgreSQL commands on Citus tables, another common request from Citus users is to get better insights into rebalance operations.

Citus 11.1 introduced rebalancing in the background, meaning you can simply start a rebalance operation by running SELECT citus_rebalance_start() function and can then disconnect from the database, while rebalancing continues in the background.

New to Citus 11.2, by using the new citus_rebalance_status() function, you can now also get detailed insights into the byte-level progress of ongoing shard group moves. For instance, you can see what phase the move is in, how many bytes of the source data have been copied, and how many bytes of WAL are still to be replicated using logical replication after the initial data copy (the “lag”):

select jsonb_pretty(details) from citus_rebalance_status();
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
                                            jsonb_pretty                                            
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
 {                                                                                                  
     "tasks": [                                                                                     
         {                                                                                          
             "LSN": {                                                                               
                 "lag": "1144 bytes",                                                               
                 "source": "1/E77A8800",                                                            
                 "target": "1/E77A8388"                                                             
             },                                                                                     
             "size": {                                                                              
                 "source": "86 MB",                                                                 
                 "target": "86 MB"                                                                  
             },                                                                                     
             "hosts": {                                                                             
                 "source": "citusnode-2:5432",                                                      
                 "target": "citusnode-5:5432"                                                       
             },                                                                                     
             "phase": "Catching Up",                                                                
             "state": "running",                                                                    
             "command": "SELECT pg_catalog.citus_move_shard_placement(102010,2,5,'force_logical')", 
             "message": "",                                                                         
             "retried": 0,                                                                          
             "task_id": 132                                                                         
         }                                                                                          
     ],                                                                                             
     "task_state_counts": {                                                                         
         "done": 2,                                                                                 
         "blocked": 13,                                                                             
         "running": 1                                                                               
     }                                                                                              
 }                                                                                                  
└────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

If you need to know the raw number of bytes to build additional automation using the JSON output, you can also call citus_rebalance_status(raw := true).

Just use highly scalable, highly compatible, highly available PostgreSQL

Postgres gives you an enormous data storage and processing toolkit in the form of comprehensive support for SQL, functions, types, sequences, constraints, extensions, etc. By using PostgreSQL, your database is always prepared for any functional requirements, and by using Citus, you are always prepared for any scale requirements. With the new 11.2 improvements we are getting another step closer to supporting all PostgreSQL features at any scale.

In addition, Citus is always up-to-date with the latest PostgreSQL, so you do not have to worry about missing out on new Postgres features or diverging from the ecosystem. Whether you use Azure Cosmos DB for PostgreSQL or Citus open source with Patroni, you are also prepared for any high availability requirements.

All in all, Citus becomes a future proof choice for building your next application.

To get started with Citus 11.2 or Patroni 3.0, here are some great resources:

Watch replay of Citus 11.2 release event livestream

Watch the Citus 11.2 Release Event replay with engineers from the Citus team talking through the new release and demonstrating new features, including hands-on setup of Citus with HA using Patroni 3.0.

YouTube Thumbnail for Citus 11.2 Release Event

Marco Slot

Written by Marco Slot

Former lead engineer for the Citus database engine at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. Talk selection team member for Citus Con: An Event for Postgres. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus