Sharding Postgres on a single Citus node, how why & when

Written by Onder Kalaci
March 20, 2021

Update in October 2022: The Citus managed database service is now available in the cloud as Azure Cosmos DB for PostgreSQL. And as always, the Citus database is also available as open source: you can find the Citus repo on GitHub or download Citus here.

One of the big new things in Citus 10 is that you can now shard Postgres on a single Citus node. So in addition to using the Citus extension to Postgres to scale out Postgres across a distributed cluster, you can now also:

  • Try out Citus on a single node with just a few simple commands
  • Shard Postgres on a single Citus node to be “scale-out-ready”
  • Simplify CI/CD pipelines by testing with single-node Citus

The Citus 10 release is chock full of new capabilities like columnar storage for Postgres, the open sourcing of the shard rebalancer, as well as the feature we are going to explore here: using Citus on a single node. No matter what type of application you run on top of Citus—multi-tenant SaaS apps, customer-facing analytics dashboards, time-series workloads, high-throughput transactional apps—there is something for everyone in Citus 10.

In this post, let’s walk through how to try out Citus on a single node. And let’s dive deep into some of the ways sharding Postgres on a single node can help you.

Try out Citus on a single node, with just a few simple commands (a Quick Start guide)

My favorite way to get started with Citus is to use a docker container to run Citus on a single node.

If you prefer, you could instead download Citus open source packages for both single-node and multi-node Citus installations, too.

The docker container includes:

  • Latest version of Postgres, as of time of this writing, PG 13
  • Latest version of Citus, as of time of this writing, Citus 10.0.3
  • Citus extension already created via CREATE EXTENSION citus;
# run PostgreSQL with single-node Citus on port 5500
docker run -d --name citus_coordinator -p 5500:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus

# connect using psql within the Docker container
docker exec -it citus_coordinator psql -U postgres

Now that you have a Postgres database with Citus installed (Citus 10.0 or later), you can use your favorite tool—such as psql—to connect to the database, and start creating distributed tables on your single Citus node.

-- Create a table with the usual PostgreSQL syntax
CREATE TABLE users_table (user_id bigserial primary key, age int);

-- Convert the table to a distributed table
SELECT create_distributed_table('users_table', 'user_id');

The Citus create_distributed_table function will divide the Postgres table across 32 shards, but you can still query it as if all the data was in one table. Because the shards are regular Postgres tables, you can still rely on the wealth of Postgres relational database features like transactions, indexes, database constraints, JOINs, and more.

With just a few commands, you can run Citus database on a single node, with a distributed table. Now, I’m going to talk more about the value that sharding Postgres on a single node gives to you.

Shard Postgres on a single Citus node to be “scale-out-ready”

Sharding Postgres has long been associated with large-scale data sizes. In fact, when most of you think about how Citus shards Postgres, you probably imagine a distributed database cluster with 2 or 4 worker nodes, or perhaps 20 or 50 or even 100 worker nodes. But with the introduction of single-node Citus in Citus 10, we can all think about sharding differently.

Even if the data sizes are not large, sharding Postgres on a single node may provide immediate benefits. By using a distributed data model on a single Citus node, you can get:

  • query parallelization for multi-shard queries
  • smaller indexes to create/maintain
  • smaller tables to auto-vacuum (in parallel!), and
  • faster bulk data loads

For example, after sharding the users_table above with the create_distributed_table function, the following SQL commands will now run in parallel across the shards, which can significantly decrease the execution times:

-- load data, ingest happens in parallel across shards
INSERT INTO users_table (age)
       SELECT 20 + (random() * 70)::int
       FROM generate_series(0, 100000);

-- this query runs in parallel across all shards
SELECT avg(age) FROM users_table;

-- index created in parallel across all shards
CREATE INDEX user_age ON users_table (age);

Queries on distributed Citus tables are not always faster than queries on regular Postgres tables. For example, Citus won’t be able to provide any parallelism benefits for a SQL query that only involves a single shard, such as this SELECT query below:

-- this query hits a single shard as sharding key is on WHERE clause
SELECT age FROM users_table WHERE user_id = 15;

Citus needs to do some additional CPU work on top of what Postgres does. The work involves two steps:

  • decide which shard the query should be routed
  • update the table name(s) to the shard name(s) in the query.

In practice, the additional overhead is tiny enough that many of the applications might not even notice the difference.

When your data no longer fits in memory—or if the database server becomes CPU bound—then with Citus on a single node, your data will already be sharded and you’ll be able to easily add more nodes to your cluster to keep your database performant. In other words, you are already prepared to scale out, or what I like to call “scale-out-ready.”

Easy to rebalance shards after adding new Citus nodes in the future

With the following steps, you can add nodes to your Citus database cluster and rebalance shards across the cluster. (Yes, the Citus shard rebalancer is now open source, as of Citus 10.)

The result: you can scale out your Citus database to as many nodes as you think required. While a shard is being moved, writes to that shard are blocked, but all reads and writes to other shards can continue. If you are using the Citus on Azure, we have some extra tricks to make writes to the shards being moved non-blocking as well. October 2022: Good news, as of Citus 11.0, the Citus open source shard rebalancer is also non-blocking for writes.

single node Citus to distributed Citus cluster
Figure 1: Sharding Postgres on a single Citus node and adopting a distributed data model from the beginning can make it easy for you to scale out your Postgres database at any time, to any scale. I like to call this being “scale-out-ready” with Citus.

Let’s add 2 more Citus worker nodes and scale out the database:

# start the first worker node on port 5501
docker run -d --name citus_worker_1 -p 5501:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus

# start the second worker on port 5502
docker run -d --name citus_worker_2 -p 5502:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus

In the example below, I use the <hostname or your ip address> parameters as '192.168.2.1'. When you try this on your network, make sure to use machine’s IP address (or hostname) so that the Citus nodes can communicate with each other.

Now, let’s connect back to the coordinator via psql:

# connect using psql to the Citus coordinator's Docker container
docker exec -it citus_coordinator psql -U postgres

And scale-out your Postgres with few SQL commands:

-- tell future worker nodes how to reach coordinator
SELECT citus_set_coordinator_host('<hostname or your ip address>', 5500);

-- add worker nodes to Citus metadata
SELECT citus_add_node('<hostname or your ip address>', 5501);
SELECT citus_add_node('<hostname or your ip address>', 5502);

-- move shards to new worker node(s)
SELECT rebalance_table_shards();
NOTICE:  Moving shard 102008 from 192.168.2.1:5500 to 192.168.2.1:5501 ...
NOTICE:  Moving shard 102009 from 192.168.2.1:5500 to 192.168.2.1:5502 ...
....
NOTICE:  Moving shard 102028 from 192.168.2.1:5500 to 192.168.2.1:5501 ...

-- OPTIONAL: if you do not want any shards on the coordinator,
-- move shards from coordinator to worker nodes
SELECT citus_drain_node('<hostname or your ip address>', 5500);

Now that the shards have been distributed, the database can use the resources on the worker node(s) as well. From your application’s perspective, nothing has changed. After adding 2 new nodes to the Citus database cluster, and after rebalancing shards across the cluster, your application is still talking to the same Postgres database. Congratulations, you have seamlessly scaled out your Postgres database with Citus!

Why use Citus if you don't need to scale out Postgres right now?

You may be wondering, “Why use Citus if I don’t need to scale out Postgres right now?” Well, if you think that your database is not going to grow in the future (and that your database will remain less than ~100GB), then vanilla Postgres is likely to handle your workload very well. However, if you expect your database to grow over time, then the answer becomes relevant to you.

If you are already sharding Postgres with Citus on a single node, then as your application grows (in terms of users or activity or features or database size) and you need to scale, you’ll be able to easily add nodes and use the Citus shard rebalancer. Adding new nodes will enable you to fit more data into memory, have higher I/O bandwidth, and provide more compute power (CPU) for your database. In other words, you will have all the means to keep your database performant for your application even as your application grows.

How to make your Postgres database scale-out-ready?

There are a few things you need to take into account to make your database scale-out-ready. In scale-out architectures, the data is distributed based on a sharding key (what we sometimes call a distribution column in Citus.) The sharding key you choose will influence not only how the data gets distributed, but also what kind of query performance improvements you will get. So a bit of up-front thinking about your query patterns and your data model can go a long way. For example, it is often not performant (or even possible) to enforce UNIQUE constraints on the columns that do not include distribution keys.

If you follow the data modeling best practices of a distributed database like Citus, you can be what I like to call “scale-out-ready”. The Citus best practices for scaling out your Postgres database seamlessly include:

If you build your database conforming to the best practices outlined above, the promise of Citus is that you’ll be able to scale out your Postgres database to some pretty large database cluster sizes. The key point here is that once you get used to the mindset of scale-out systems, you’ll realize how easy it is to follow the data modeling best practices.

image representation of a Postgres node scaling out
Figure 2: Start with Citus on a single Postgres node, and easily add more nodes as your application grows.

Simplify CI/CD pipelines for your Citus applications

The 3rd interesting scenario for using Citus on single node is with your CI/CD pipelines. If Citus is part of your CI/CD workflow, setting up a test environment is so much simpler now. Instead of having two or more database servers, you can setup Citus on a single node which will provide the same experience and coverage. In other words, whatever you can (or cannot) do on multi-node Citus is the same on the new single-node Citus capability.

Back before we made single-node Citus a first-class citizen in this Citus 10 release, we had shared a sneak preview of sharding on a single node in our Citus 9.5 blog post. And we received this lovely feedback from Ivan Vyazmitinov, one of our Citus users, just a few weeks later:

“A piece of feedback about the new ‘single node cluster’ feature from the blog post: after a little bit of testing we’ve immediately switched all of our integration tests to it. It is literally a lifesaver in terms of complexity of configuration of tests and it also completely eliminated our tests flakiness issue due to errors from unhealthy citus containers on tests’ startup.”

Using single-node Citus in the cloud

One of the easiest ways run Citus is in the cloud: you can deploy the Hyperscale (Citus) option in the Azure Database for PostgreSQL managed service.

For those of you who use Citus as part of the Azure Database for PostgreSQL managed service, the good news is: as of update August 2021 Citus 10 is now available on Citus on Azure with the new single node configuration. More details in Nik’s initial preview post on what’s new and his follow-on post announcing that Citus 10 is GA in Hyperscale (Citus).

Citus on a single node opens the door to new possibilities

We are very excited to bring you the new Citus on a single node feature. And while “opens the door to new possibilities” may sound lofty, it’s true. Single-node Citus gives you a way to be “scale-out-ready” on day zero. If your application is already running on a single-node Postgres, you now have a way to adopt a data model that allows you to scale your Postgres database as much as you need in the future. In essence, as your application grows and you need to scale, you won’t be faced with any type of database migration challenges.

Perhaps more importantly, with the new Citus single node feature in Citus 10, it’s easier than ever to try out Citus. And you can do so with just a single docker run command.

Here are some of the best links if you want to try out Citus today:

  • Getting Started page curated by one of my teammates, with useful resources no matter which way you prefer to learn (reading, watching, doing)
  • Citus open source repo on GitHub, to report issues, check out the source code, & learn more
  • Download page, if you’re ready to install the Citus packages on the OS of your choice, and start kicking the tires
  • Citus Public Slack, a Q&A forum for the open source community
Onder Kalaci

Written by Onder Kalaci

Former principal software engineer on the Postgres team at Microsoft. Joined Citus Data to scale out Postgres. Thread level parallelism = perfect topic for dinner conversation.

onderkalaci