POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Marco Slot
March 26, 2022
Today we released Citus 11.0 beta, which is our first ever beta release of the Citus open source extension to Postgres. The reason we are releasing a beta version of 11.0 is that we are introducing a few fundamentally new capabilities, and we would like to get feedback from those of you who use Citus before we release Citus 11.0 to the world.
The biggest change in Citus 11.0 beta is that the schema and Citus metadata are now automatically synchronized throughout the database cluster. That means you can always query distributed tables from any node in a Citus cluster!
The easiest way to use Citus is to connect to the coordinator node and use it for both schema changes and distributed queries, but for very demanding applications, you now have the option to load balance distributed queries across the worker nodes in (parts of) your application by using a different connection string and factoring a few limitations.
We also deprecated a few features in 11.0 beta to accelerate our development, but we expect that will not affect most of you.
In this 11.0 beta blog post you will learn about:
You can try out the new Citus 11.0 beta to see how your application will work with it—or to experiment with the new features. You can find the packages in our installation instructions.
[Update: In mid-June 2022, the Citus 11.0 beta ended and Citus v11 released. You can find a good overview in the Citus 11 blog post titled: Citus 11 for Postgres goes fully open source, with query from any node; you can find the detailed nitty-gritty release notes on our v11 Updates page; and you can see a few of the new Citus 11 features demo'd in this recording of the Citus 11 release party on YouTube.]
And we’ve just rolled out a new type of release notes for 11.0 beta, which should be useful if you want to dive deeper into our open source GitHub repo and see the issues we've addressed in this release. If you find it useful please tell us so on Slack! We plan to publish this type of release notes for upcoming Citus releases too. You can find these release notes via the “UPDATES” link in the top nav of the Citus website.
Citus is probably the best way of scaling your PostgreSQL database. When you distribute your tables, Citus can route and parallelize complex queries across a large cluster of PostgreSQL servers. Apart from the initial set-up, the distribution is transparent to the application: your application still connects to a single PostgreSQL node (the “coordinator”, in Citus parlance), and the coordinator distributes the Postgres queries sent by your application under the covers.
The single coordinator architecture has many benefits and is very performant, but for certain high-performance workloads the coordinator can become a bottleneck. In practice, very few applications get bottlenecked by the coordinator, since the Citus coordinator does relatively little work. However, we do find that applications developers often want to be prepared for the future in terms of scalability, and there are some very demanding enterprise applications.
For quite a while, Citus has had the ability to perform distributed queries via the worker nodes, by synchronizing distributed table schema and metadata. In the past, we sometimes referred to this capability as “MX”. However, the MX feature had various limitations around the use of sequences, functions, schemas, and other database objects—meaning not all tables supported metadata syncing.
Citus 11.0 beta changes to a new mode of operation: Now all Citus database clusters always use metadata syncing. That means that with Citus 11.0 beta and all future versions, you can always run distributed Postgres queries from any node.
When you start using Citus 11.0 beta, there is nothing you need to do to enable the new metadata syncing feature. Every distributed table, database object, and schema change will automatically be propagated to all the Citus worker nodes. Schema changes and node management still need to be sent to the Citus coordinator, and you can choose to send your distributed Postgres queries to the coordinator or to any other node by changing the connection string in your application.
If you need to perform many queries per second on a PostgreSQL database, it may be necessary to use a relatively large number of connections. Ultimately, your aggregate throughput is mathematically bottlenecked on [number of connections] / [avg. response time] because you can only do one query per connection at a time.
When your application opens a connection to one of the Citus nodes, that connection spawns a Postgres process. This Postgres process will need to make internal connections to other nodes to query the shards of distributed tables. These internal connections are cached to minimize response time. That does mean that each connection from the client can eventually result in an additional internal connection to each other node, so each node will eventually get as many connections as the client makes to the whole database cluster. Fortunately, we made significant improvements to connection scalability in PostgreSQL 14, allowing Postgres (and Citus) to keep performing well at high connection counts.
If you decide to connect to the worker nodes from your application to run distributed queries, then your client connections are technically competing with internal connections. To make sure that every Postgres process that handles a client connection can also make internal connections to all other nodes, we added the citus.max_client_connections
setting. This setting restricts the number of external client connections, while continuing to allow internal connections between Citus nodes. In addition to the usual installation instructions, we recommend adding the following settings to postgresql.conf
on each Citus node (coordinator and all workers) to accommodate a large number of client connections:
# The maximum number of client + internal connections a node can handle
# The total number of client connections across all nodes should never exceed this number
max_connections = 6000
# The number of client connections an individual node can handle
# Should be no greater than: max_connections / node count including the coordinator
citus.max_client_connections = 500
With these settings, each node will accept up to 500 connections from your application, so if you have 10 worker nodes and 1 coordinator, then your application can make 5500 connections in total. You can increase this number further by using a connection pooler like pgbouncer on each node.
We also highly recommend adding the Citus coordinator to the metadata, such that worker nodes can connect to the coordinator as well. Some Citus features will only be available if the coordinator is in the metadata. We will likely make adding the coordinator required in the future.
-- on all nodes:
CREATE EXTENSION citus;
-- only on coordinator: add coordinator to metadata
SELECT citus_set_coordinator_host('<coordinator’s own hostname>', 5432);
-- only on coordinator: add worker nodes to metadata
SELECT citus_add_node('<worker 1 hostname>', 5432);
SELECT citus_add_node('<worker 2 hostname>', 5432);
-- only on coordinator:
CREATE TABLE items (key text, value text);
SELECT create_distributed_table('items', 'key');
-- from any node:
INSERT INTO items VALUES ('hello', 'world');
Once your Citus 11.0 beta cluster is up and running, you have 2 choices:
Example JDBC connection string that load balances across 2 workers:
jdbc:postgresql://user@worker1.host:5432,worker2.host:5432/postgres?loadBalanceHosts=true
Example Npgsql connection string that load balances across 2 workers:
Host=worker1.host,worker2.host;Database=postgres;Username=user;Load Balance Hosts=true
An alternative could be to set up a DNS record that includes all worker node IPs. A downside of using DNS is that connections from the same machine that are opened at the same time will typically use the same IP, due to local DNS caching. Another alternative is to set up a dedicated load balancer like HAProxy.
When running your Postgres queries via the Citus worker nodes with 11.0 beta, there are a few limitations to be aware of:
citus.use_citus_managed_tables
setting or create foreign keys to reference tables.bigint
will include the ID of the node that you are connected to in the first 16 bits of the sequence numbers, which means sequence numbers are still unique, but not monotonic.We expect to address the limitations above in future Citus releases.
If you are upgrading an existing (non-production) cluster to Citus 11.0 beta, then after installing the new packages, you will need to call a function to finalize the upgrade:
-- on all nodes
ALTER EXTENSION citus UPDATE;
-- only on the coordinator
select citus_finalize_upgrade_to_citus11();
The upgrade function will make sure that all worker nodes have the right schema and metadata. It also addresses several naming issues that affected shards of partitioned tables.
If there is anything preventing metadata from being synchronized (e.g. missing permissions on worker nodes, or conflicting objects) the upgrade function will throw an error. You can still use your existing Citus database cluster via the coordinator until you address the issue and finalize the upgrade, but some new 11.0 beta functionality will not be available.
A frequently requested Citus feature is to have better insights into what’s going on in the database cluster. This becomes even more important when some queries come in via the worker nodes.
We revamped the citus_dist_stat_activity
view to show the information from pg_stat_activity
from all client sessions across all nodes, along with a global_pid
(or gpid
), which uniquely identifies a client session and all internal connections associated with the session. The gpid
starts with the node ID of the node that initiated the query, meaning the node that the client connected to.
SELECT nodeid, global_pid, query FROM citus_dist_stat_activity where application_name = 'psql';
┌────────┬─────────────┬────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ nodeid │ global_pid │ state │ query │
├────────┼─────────────┼────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ 10000001303 │ active │ SELECT nodeid, global_pid, state, query FROM citus_dist_stat_activity where application_name = 'psql'; │
│ 2 │ 20000001346 │ active │ select count(*), pg_sleep(300) from test; │
└────────┴─────────────┴────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘
If you want to cancel a specific query, you can simply pass the global_pid
to pg_cancel_backend
. This works from any node.
SELECT pg_cancel_backend(20000001346);
You can also see everything going on in the cluster (both distributed queries and internal queries) using the new citus_stat_activity
view:
SELECT nodeid, global_pid, state, query, is_worker_query FROM citus_stat_activity WHERE global_pid = 20000001500;
┌────────┬─────────────┬────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────┐
│ nodeid │ global_pid │ state │ query │ is_worker_query │
├────────┼─────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────┤
│ 2 │ 20000001500 │ active │ select count(pg_sleep(300)) from test; │ f │
│ 2 │ 20000001500 │ active │ SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102153 test WHERE true │ t │
│ 2 │ 20000001500 │ active │ SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102155 test WHERE true │ t │
│ 3 │ 20000001500 │ active │ SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102156 test WHERE true │ t │
│ 3 │ 20000001500 │ active │ SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102154 test WHERE true │ t │
└────────┴─────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────┘
If you are looking at a specific node using pg_stat_activity
, you can now also find the gpid
that a worker query belongs to in the application_name:
select pid, application_name, state, query from pg_stat_activity where query like '%count%' and application_name <> 'psql';
┌──────┬─────────────────────────────────┬────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ pid │ application_name │ state │ query │
├──────┼─────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1548 │ citus_internal gpid=10000001547 │ active │ SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102153 test WHERE true │
│ 1550 │ citus_internal gpid=10000001547 │ active │ SELECT count(pg_sleep('300'::double precision)) AS count FROM public.test_102155 test WHERE true │
└──────┴─────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘
Since every node needs to be able to connect to every other node in a Citus cluster, we also introduced a new health check function that checks the connectivity of all possible paths. The result column indiciates whether the connection attempt was successful.
select * from citus_check_cluster_node_health();
┌───────────────┬───────────────┬─────────────┬─────────────┬────────┐
│ from_nodename │ from_nodeport │ to_nodename │ to_nodeport │ result │
├───────────────┼───────────────┼─────────────┼─────────────┼────────┤
│ localhost │ 1400 │ localhost │ 1400 │ t │
│ localhost │ 1400 │ localhost │ 1401 │ t │
│ localhost │ 1400 │ localhost │ 1402 │ t │
│ localhost │ 1401 │ localhost │ 1400 │ t │
│ localhost │ 1401 │ localhost │ 1401 │ t │
│ localhost │ 1401 │ localhost │ 1402 │ t │
│ localhost │ 1402 │ localhost │ 1400 │ t │
│ localhost │ 1402 │ localhost │ 1401 │ t │
│ localhost │ 1402 │ localhost │ 1402 │ t │
└───────────────┴───────────────┴─────────────┴─────────────┴────────┘
(9 rows)
With these functions, you should have far better insight into what’s going on in your cluster even if you do all queries via the coordinator. [Update in July 2022: More details about the new capabilities for monitoring distributed Postgres activity in Citus 11, via the new citus_stat_activity and citus_lock_waits views.]
In a distributed database we often need to make trade-offs between consistency, fault tolerance, parallelism, and other distributed systems aspects. Citus needs to support PostgreSQL’s interactive, multi-statement transaction blocks, which is particularly challenging in a distributed setting.
For example, Citus typically parallelizes expensive operations across shards—such as analytical queries and create_distributed_table()
over multiple connections per worker. When creating a database object, Citus propagates it to worker nodes over a single connection per worker. Combining those two operations in a single multi-statement transaction may cause issues because the parallel connections will not be able to see the object that was created over a single connection but not yet committed.
Consider a transaction block that creates a type, a table, loads data, and distributes the table:
BEGIN;
-- type creation over a single connection:
CREATE TYPE coordinates AS (x int, y int);
CREATE TABLE positions (object_id text primary key, position coordinates);
-- data loading thus goes over a single connection:
SELECT create_distributed_table(‘positions’, ‘object_id’);
\COPY positions FROM ‘positions.csv’
…
Prior to Citus 11.0 beta, Citus would defer creating the type on the worker nodes and commit it separately when doing the create_distributed_table
. This enabled the data copying in create_distributed_table
to happen in parallel. However, it also meant that the type was not always present on the Citus worker nodes—or if the transaction rolled back, it would be present only on the worker nodes. We were able to hide these inconsistencies, but eventually they could cause issues.
With Citus 11.0 beta the default behaviour changes to prioritize schema consistency between coordinator and worker nodes. That does have a downside: If object propagation happens after a parallel command in the same transaction, then the transaction can no longer be completed, as highlighted by the ERROR in the code block below:
BEGIN;
CREATE TABLE items (key text, value text);
-- parallel data loading:
SELECT create_distributed_table(‘items’, ‘key’);
\COPY items FROM ‘items.csv’
CREATE TYPE coordinates AS (x int, y int);
ERROR: cannot run type command because there was a parallel operation on a distributed table in the transaction
If you run into this issue, there are 2 simple workarounds:
set citus.create_object_propagation to deferred;
to return to the old object propagation behavior, in which case there may be some inconsistency between which database objects exist on different nodes.set citus.multi_shard_modify_mode to sequential
to disable per-node parallelism. Data load in the same transaction might be slower.Back in 2016, we announced the deprecation of the statement-based shard replication for high availability (HA) in favour of streaming replication. When you enable high availability in Citus on Azure, every node will have a hot standby—meaning all the shards on that node are replicated via streaming replication. Even if you do not enable high availability, data is internally replicated by managed disks to protect against any data loss.
Although deprecated, we never removed statement-based replication... and it can still be useful for scaling reads in specific scenarios, However, the deprecated HA-related logic often caused issues and had prevented us from implementing metadata syncing for replicated tables. As part of the Citus 11.0 beta release, we are therefore changing the behavior as follows:
Prior to Citus 11.0 beta, when a write on a replicated shard failed on one of the shard placements, Citus marked that placement as invalid—after which the shard would have to be re-replicated. This feature never worked well, since sporadic write failures could invalidate a placement and incur expensive (write-blocking) re-replication.
Starting with Citus 11.0 beta, writes to replicated shards always use 2PC—which means they only succeed if all placements are up. In addition, the metadata for replicated tables is synced so they can be queried from any node.
Open-source users who are using statement-based shard replication today can upgrade to Citus 11.0 beta—however, to continue accepting writes to a shard when the node holding one of the replicas fails, the failed node should be disabled via the citus_disable_node function. The shards can still be re-replicated using replicate_table_shards after replacing or reactivating the node.
If you want to use statement-based replication to scale read throughput, you need to:
citus.shard_replication_factor
to 2 prior to creating a distributed table, andcitus.task_assignment_policy
to 'round-robin' to load balance queries across replicas.The downside of using statement-based replication to scale read throughput is that writes have higher response time, and updates and deletes are serialized to keep replicas in sync.
Like PostgreSQL, Citus maintains long-term backwards compatibility. We go through great lengths to make sure your applications continue working when upgrading Citus. However, sometimes a feature is no longer in line with how Citus is used and gets in the way of development. There are a few Citus features which we decided to remove in 11.0 beta:
cstore_fdw
did not support important PostgreSQL features like streaming replication and backups, hence we rarely saw Citus customers use columnar storage before Citus 10. Many companies are now successfully using Citus’ built-in columnar storage for time series data, and we are therefore dropping support for creating or using distributed cstore_fdw
tables. If you have distributed cstore_fdw
tables, we recommend converting them to the columnar access method before upgrading to 11.0 beta.Citus is the only distributed database for both transactional and analytical workloads that is fully implemented as a PostgreSQL extension—meaning that Citus supports the powerful features of PostgreSQL at scale and inherits PostgreSQL’s stability, performance, versatility, extensibility, plus a large ecosystem of tools and extensions.
With automatic metadata syncing in the 11.0 beta release of Citus open source, you now have the option to query your Citus cluster from any node, improving Citus’ scalability even further.
If you’re interested in trying out the new Citus 11.0 beta, you can find installation instructions for the beta in the Citus docs. Once you’ve installed Citus, the getting started page has a lot of good content on how to get started, including tutorials and videos. Finally, if you want to know more about how Citus works internally, check out our SIGMOD paper.