Test drive the Citus 11.0 beta for Postgres

Written by Marco Slot
March 26, 2022

Today we released Citus 11.0 beta, which is our first ever beta release of the Citus open source extension to Postgres. The reason we are releasing a beta version of 11.0 is that we are introducing a few fundamentally new capabilities, and we would like to get feedback from those of you who use Citus before we release Citus 11.0 to the world.

The biggest change in Citus 11.0 beta is that the schema and Citus metadata are now automatically synchronized throughout the database cluster. That means you can always query distributed tables from any node in a Citus cluster!

The easiest way to use Citus is to connect to the coordinator node and use it for both schema changes and distributed queries, but for very demanding applications, you now have the option to load balance distributed queries across the worker nodes in (parts of) your application by using a different connection string and factoring a few limitations.

We also deprecated a few features in 11.0 beta to accelerate our development, but we expect that will not affect most of you.

In this 11.0 beta blog post you will learn about:

You can try out the new Citus 11.0 beta to see how your application will work with it—or to experiment with the new features. You can find the packages in our installation instructions.

And we’ve just rolled out a new type of release notes for 11.0 beta, which should be useful if you want to dive deeper into our open source GitHub repo and see the issues we’ve addressed in this release. If you find it useful please tell us so on Slack! We plan to publish this type of release notes for upcoming Citus releases too. You can find these release notes via the “UPDATES” link in the top nav of the Citus website.

Automatic metadata syncing allows you to query from any node

Citus is probably the best way of scaling your PostgreSQL database. When you distribute your tables, Citus can route and parallelize complex queries across a large cluster of PostgreSQL servers. Apart from the initial set-up, the distribution is transparent to the application: your application still connects to a single PostgreSQL node (the “coordinator”, in Citus parlance), and the coordinator distributes the Postgres queries sent by your application under the covers.

diagram 1: users and items are distributed tables, and their metadata is only on the coordinator
Figure 1: A Citus cluster in Citus 10.2 or earlier, where users and items are distributed tables—and their metadata is only on the coordinator.

The single coordinator architecture has many benefits and is very performant, but for certain high-performance workloads the coordinator can become a bottleneck. In practice, very few applications get bottlenecked by the coordinator, since the Citus coordinator does relatively little work. However, we do find that applications developers often want to be prepared for the future in terms of scalability, and there are some very demanding enterprise applications.

For quite a while, Citus has had the ability to perform distributed queries via the worker nodes, by synchronizing distributed table schema and metadata. In the past, we sometimes referred to this capability as “MX”. However, the MX feature had various limitations around the use of sequences, functions, schemas, and other database objects—meaning not all tables supported metadata syncing.

Citus 11.0 beta changes to a new mode of operation: Now all Citus database clusters always use metadata syncing. That means that with Citus 11.0 beta and all future versions, you can always run distributed Postgres queries from any node.

diagram 2: users and items are distributed tables, and with the new automated metadata syncing feature
Figure 2: A Citus 11.0 beta cluster where users and items are distributed tables—and with the new automated metadata syncing feature, their metadata is synchronized to all nodes.

When you start using Citus 11.0 beta, there is nothing you need to do to enable the new metadata syncing feature. Every distributed table, database object, and schema change will automatically be propagated to all the Citus worker nodes. Schema changes and node management still need to be sent to the Citus coordinator, and you can choose to send your distributed Postgres queries to the coordinator or to any other node by changing the connection string in your application.

How to configure your Citus 11.0 beta cluster

If you need to perform many queries per second on a PostgreSQL database, it may be necessary to use a relatively large number of connections. Ultimately, your aggregate throughput is mathematically bottlenecked on [number of connections] / [avg. response time] because you can only do one query per connection at a time.

When your application opens a connection to one of the Citus nodes, that connection spawns a Postgres process. This Postgres process will need to make internal connections to other nodes to query the shards of distributed tables. These internal connections are cached to minimize response time. That does mean that each connection from the client can eventually result in an additional internal connection to each other node, so each node will eventually get as many connections as the client makes to the whole database cluster. Fortunately, we made significant improvements to connection scalability in PostgreSQL 14, allowing Postgres (and Citus) to keep performing well at high connection counts.

If you decide to connect to the worker nodes from your application to run distributed queries, then your client connections are technically competing with internal connections. To make sure that every Postgres process that handles a client connection can also make internal connections to all other nodes, we added the citus.max_client_connections setting. This setting restricts the number of external client connections, while continuing to allow internal connections between Citus nodes. In addition to the usual installation instructions, we recommend adding the following settings to postgresql.conf on each Citus node (coordinator and all workers) to accommodate a large number of client connections:

# The maximum number of client + internal connections a node can handle
# The total number of client connections across all nodes should never exceed this number
max_connections = 6000

# The number of client connections an individual node can handle
# Should be no greater than: max_connections / node count including the coordinator
citus.max_client_connections = 500

With these settings, each node will accept up to 500 connections from your application, so if you have 10 worker nodes and 1 coordinator, then your application can make 5500 connections in total. You can increase this number further by using a connection pooler like pgbouncer on each node.

We also highly recommend adding the Citus coordinator to the metadata, such that worker nodes can connect to the coordinator as well. Some Citus features will only be available if the coordinator is in the metadata. We will likely make adding the coordinator required in the future.

-- on all nodes:
CREATE EXTENSION citus;
-- only on coordinator: add coordinator to metadata
SELECT citus_set_coordinator_host('<coordinator’s own hostname>', 5432);
-- only on coordinator: add worker nodes to metadata
SELECT citus_add_node('<worker 1 hostname>', 5432);
SELECT citus_add_node('<worker 2 hostname>', 5432);
-- only on coordinator:
CREATE TABLE items (key text, value text);
SELECT create_distributed_table('items', 'key');
-- from any node:
INSERT INTO items VALUES ('hello', 'world');

Load balancing queries across the worker nodes in Citus 11.0 beta

Once your Citus 11.0 beta cluster is up and running, you have 2 choices:

  • you can either connect your application to the coordinator as usual, or
  • load balance your Postgres queries across the worker nodes by using a client that supports load balancing with a custom connection string like JDBC or Npgsql. You should also be able to load balance in existing applications that already uses one of these clients.

Example JDBC connection string that load balances across 2 workers:

jdbc:postgresql://[email protected]:5432,worker2.host:5432/postgres?loadBalanceHosts=true

Example Npgsql connection string that load balances across 2 workers:

Host=worker1.host,worker2.host;Database=postgres;Username=user;Load Balance Hosts=true

An alternative could be to set up a DNS record that includes all worker node IPs. A downside of using DNS is that connections from the same machine that are opened at the same time will typically use the same IP, due to local DNS caching. Another alternative is to set up a dedicated load balancer like HAProxy.

When running your Postgres queries via the Citus worker nodes with 11.0 beta, there are a few limitations to be aware of:

  • You need to configure your application to perform schema changes via the Citus coordinator while queries can go via any node.
  • If you create a table on one worker node, it will not show up if you subsequently connect to a different worker node.
  • Local tables on the coordinator only appear on worker nodes if you enable the citus.use_citus_managed_tables setting or create foreign keys to reference tables.
  • Sequences that generate bigint will include the ID of the node that you are connected to in the first 16 bits of the sequence numbers, which means sequence numbers are still unique, but not monotonic.
  • Sequences that generate int/smallint will throw an error when trying to insert from a worker node

We expect to address the limitations above in future Citus releases.

Upgrading an existing Citus database cluster to Citus 11.0 beta

If you are upgrading an existing (non-production) cluster to Citus 11.0 beta, then after installing the new packages, you will need to call a function to finalize the upgrade:

-- on all nodes
ALTER EXTENSION citus UPDATE;

-- only on the coordinator
select citus_finalize_upgrade_to_citus11();

The upgrade function will make sure that all worker nodes have the right schema and metadata. It also addresses several naming issues that affected shards of partitioned tables.

If there is anything preventing metadata from being synchronized (e.g. missing permissions on worker nodes, or conflicting objects) the upgrade function will throw an error. You can still use your existing Citus database cluster via the coordinator until you address the issue and finalize the upgrade, but some new 11.0 beta functionality will not be available.

New views for cluster insights

A frequently requested Citus feature is to have better insights into what’s going on in the database cluster. This becomes even more important when some queries come in via the worker nodes.

We revamped the citus_dist_stat_activity view to show the information from pg_stat_activity from all client sessions across all nodes, along with a global_pid (or gpid), which uniquely identifies a client session and all internal connections associated with the session. The gpid starts with the node ID of the node that initiated the query, meaning the node that the client connected to.

SELECT nodeid, global_pid, query FROM citus_dist_stat_activity where application_name = 'psql';
┌────────┬─────────────┬────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 nodeid  global_pid   state                                                   query                                                  
├────────┼─────────────┼────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
      1  10000001303  active  SELECT nodeid, global_pid, state, query FROM citus_dist_stat_activity where application_name = 'psql'; 
      2  20000001346  active  select count(*), pg_sleep(300) from test;                                                              
└────────┴─────────────┴────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘

If you want to cancel a specific query, you can simply pass the global_pid to pg_cancel_backend. This works from any node.

SELECT pg_cancel_backend(20000001346);

You can also see everything going on in the cluster (both distributed queries and internal queries) using the new citus_stat_activity view:

SELECT nodeid, global_pid, state, query, is_worker_query FROM citus_stat_activity WHERE global_pid = 20000001500;
┌────────┬─────────────┬────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────┐
 nodeid  global_pid   state                                                query                                                is_worker_query 
├────────┼─────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────┤
      2  20000001500  active  select count(pg_sleep(300)) from test;                                                            f               
      2  20000001500  active  SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102153 test WHERE true  t               
      2  20000001500  active  SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102155 test WHERE true  t               
      3  20000001500  active  SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102156 test WHERE true  t               
      3  20000001500  active  SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102154 test WHERE true  t               
└────────┴─────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────┘

If you are looking at a specific node using pg_stat_activity, you can now also find the gpid that a worker query belongs to in the application_name:

select pid, application_name, state, query from pg_stat_activity where query like '%count%' and application_name <> 'psql';
┌──────┬─────────────────────────────────┬────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
 pid          application_name          state                                                query                                               
├──────┼─────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
 1548  citus_internal gpid=10000001547  active  SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102153 test WHERE true 
 1550  citus_internal gpid=10000001547  active  SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102155 test WHERE true 
└──────┴─────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘

Since every node needs to be able to connect to every other node in a Citus cluster, we also introduced a new health check function that checks the connectivity of all possible paths. The result column indiciates whether the connection attempt was successful.

select * from citus_check_cluster_node_health();
┌───────────────┬───────────────┬─────────────┬─────────────┬────────┐
 from_nodename  from_nodeport  to_nodename  to_nodeport  result 
├───────────────┼───────────────┼─────────────┼─────────────┼────────┤
 localhost               1400  localhost           1400  t      
 localhost               1400  localhost           1401  t      
 localhost               1400  localhost           1402  t      
 localhost               1401  localhost           1400  t      
 localhost               1401  localhost           1401  t      
 localhost               1401  localhost           1402  t      
 localhost               1402  localhost           1400  t      
 localhost               1402  localhost           1401  t      
 localhost               1402  localhost           1402  t      
└───────────────┴───────────────┴─────────────┴─────────────┴────────┘
(9 rows)

With these functions, you should have far better insight into what’s going on in your cluster even if you do all queries via the coordinator.

Strict, immediate metadata syncing in transaction blocks

In a distributed database we often need to make trade-offs between consistency, fault tolerance, parallelism, and other distributed systems aspects. Citus needs to support PostgreSQL’s interactive, multi-statement transaction blocks, which is particularly challenging in a distributed setting.

For example, Citus typically parallelizes expensive operations across shards—such as analytical queries and create_distributed_table() over multiple connections per worker. When creating a database object, Citus propagates it to worker nodes over a single connection per worker. Combining those two operations in a single multi-statement transaction may cause issues because the parallel connections will not be able to see the object that was created over a single connection but not yet committed.

Consider a transaction block that creates a type, a table, loads data, and distributes the table:

BEGIN;
-- type creation over a single connection:
CREATE TYPE coordinates AS (x int, y int);
CREATE TABLE positions (object_id text primary key, position coordinates);
-- data loading thus goes over a single connection:
SELECT create_distributed_table(positions, object_id);
\COPY positions FROM positions.csv

Prior to Citus 11.0 beta, Citus would defer creating the type on the worker nodes and commit it separately when doing the create_distributed_table. This enabled the data copying in create_distributed_table to happen in parallel. However, it also meant that the type was not always present on the Citus worker nodes—or if the transaction rolled back, it would be present only on the worker nodes. We were able to hide these inconsistencies, but eventually they could cause issues.

With Citus 11.0 beta the default behaviour changes to prioritize schema consistency between coordinator and worker nodes. That does have a downside: If object propagation happens after a parallel command in the same transaction, then the transaction can no longer be completed, as highlighted by the ERROR in the code block below:

BEGIN;
CREATE TABLE items (key text, value text);
-- parallel data loading:
SELECT create_distributed_table(items, key);
\COPY items FROM items.csv
CREATE TYPE coordinates AS (x int, y int);
ERROR:  cannot run type command because there was a parallel operation on a distributed table in the transaction

If you run into this issue, there are 2 simple workarounds:

  1. Use set citus.create_object_propagation to deferred; to return to the old object propagation behavior, in which case there may be some inconsistency between which database objects exist on different nodes.
  2. Use set citus.multi_shard_modify_mode to sequential to disable per-node parallelism. Data load in the same transaction might be slower.

Statement-based shard replication changes

Back in 2016, we announced the deprecation of the statement-based shard replication for high availability (HA) in favour of streaming replication. When you enable high availability in Hyperscale (Citus) on Azure Database for PostgreSQL, every node will have a hot standby—meaning all the shards on that node are replicated via streaming replication. Even if you do not enable high availability, data is internally replicated by managed disks to protect against any data loss.

Although deprecated, we never removed statement-based replication… and it can still be useful for scaling reads in specific scenarios, However, the deprecated HA-related logic often caused issues and had prevented us from implementing metadata syncing for replicated tables. As part of the Citus 11.0 beta release, we are therefore changing the behavior as follows:

Prior to Citus 11.0 beta, when a write on a replicated shard failed on one of the shard placements, Citus marked that placement as invalid—after which the shard would have to be re-replicated. This feature never worked well, since sporadic write failures could invalidate a placement and incur expensive (write-blocking) re-replication.

Starting with Citus 11.0 beta, writes to replicated shards always use 2PC—which means they only succeed if all placements are up. In addition, the metadata for replicated tables is synced so they can be queried from any node.

Open-source users who are using statement-based shard replication today can upgrade to Citus 11.0 beta—however, to continue accepting writes to a shard when the node holding one of the replicas fails, the failed node should be disabled via the citus_disable_node function. The shards can still be re-replicated using replicate_table_shards after replacing or reactivating the node.

If you want to use statement-based replication to scale read throughput, you need to:

  • set citus.shard_replication_factor to 2 prior to creating a distributed table, and
  • set citus.task_assignment_policy to ‘round-robin’ to load balance queries across replicas.

The downside of using statement-based replication to scale read throughput is that writes have higher response time, and updates and deletes are serialized to keep replicas in sync.

Deprecations: Saying goodbye to seldom-used features

Like PostgreSQL, Citus maintains long-term backwards compatibility. We go through great lengths to make sure your applications continue working when upgrading Citus. However, sometimes a feature is no longer in line with how Citus is used and gets in the way of development. There are a few Citus features which we decided to remove in 11.0 beta:

  • Invalid shard placements: As described in the previous section, shards are no longer marked invalid when a write fails, since this behavior had some flaws and reduced reliability when using statement-based replication.
  • Append-distributed table functions: The original method of distribution in Citus was “append” distribution, which was optimized for append-only data. Hash-distributed tables are a lot easier to use and have many more capabilities and can also handle append-only data very well through partitioning. Citus 11.0 beta removes the functions for creating shards and loading new data into append-distributed tables. We are not aware of any append-distributed table users, but just in case: you can still upgrade to 11.0 beta, but the tables will become read only. We recommend creating new distributed tables that use the default hash-distribution, and moving the data over using an INSERT .. SELECT command.
  • Distributed cstore_fdw tables (should switch to columnar access method): Since version 10.0, Citus comes with built-in columnar storage. Prior to Citus 10.0, it was possible to use Citus with columnar storage using the now deprecated cstore_fdw extension. However, cstore_fdw did not support important PostgreSQL features like streaming replication and backups, hence we rarely saw Citus customers use columnar storage before Citus 10. Many companies are now successfully using Citus’ built-in columnar storage for time series data, and we are therefore dropping support for creating or using distributed cstore_fdw tables. If you have distributed cstore_fdw tables, we recommend converting them to the columnar access method before upgrading to 11.0 beta.

Towards new levels of scalability with Citus 11.0 beta

Citus is the only distributed database for both transactional and analytical workloads that is fully implemented as a PostgreSQL extension—meaning that Citus supports the powerful features of PostgreSQL at scale and inherits PostgreSQL’s stability, performance, versatility, extensibility, plus a large ecosystem of tools and extensions.

With automatic metadata syncing in the 11.0 beta release of Citus open source, you now have the option to query your Citus cluster from any node, improving Citus’ scalability even further.

If you’re interested in trying out the new Citus 11.0 beta, you can find installation instructions for the beta in the Citus docs. Once you’ve installed Citus, the getting started page has a lot of good content on how to get started, including tutorials and videos. Finally, if you want to know more about how Citus works internally, check out our SIGMOD paper.

Marco Slot

Written by Marco Slot

Lead engineer on the Citus engine team at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus