POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Halil Ozan Akgul
May 3, 2021
This post by Halil Ozan Akgul was originally published on the Azure Database for PostgreSQL Blog on Microsoft TechCommunity.
Citus is an extension to Postgres that lets you distribute your application’s workload across multiple nodes. Whether you are using Citus open source or using Citus as part of a managed Postgres service in the cloud, one of the first things you do when you start using Citus is to distribute your tables. While distributing your Postgres tables you need to decide on some properties such as distribution column, shard count, colocation. And even before you decide on your distribution column (sometimes called a distribution key, or a sharding key), when you create a Postgres table, your table is created with an access method.
Previously you had to decide on these table properties up front, and then you went with your decision. Or if you really wanted to change your decision, you needed to start over. The good news is that in Citus 10, we introduced 2 new user-defined functions (UDFs) to make it easier for you to make changes to your distributed Postgres tables.
Before Citus 9.5, if you wanted to change any of the properties of the distributed table, you would have to create a new table with the desired properties and move everything to this new table. But in Citus 9.5 we introduced a new function, undistribute_table. With the undistribute_table
function you can convert your distributed Citus tables back to local Postgres tables and then distribute them again with the properties you wish. But undistributing and then distributing again is… 2 steps. In addition to the inconvenience of having to write 2 commands, undistributing and then distributing again has some more problems:
So, in Citus 10, we introduced 2 new functions to reduce the steps you need to make changes to your tables:
alter_distributed_table
alter_table_set_access_method
In this post you’ll find some tips about how to use the alter_distributed_table
function to change the shard count, distribution column, and the colocation of a distributed Citus table. And we’ll show how to use the alter_table_set_access_method
function to change, well, the access method. An important note: you may not ever need to change your Citus table properties. We just want you to know, if you ever do, you have the flexibility. And with these Citus tips, you will know how to make the changes.
When you distribute a Postgres table with the create_distributed_table function, you must pick a distribution column and set the distribution_column
parameter. During the distribution, Citus uses a configuration variable called shard_count for deciding the shard count of the table. You can also provide colocate_with
parameter to pick a table to colocate with (or colocation will be done automatically, if possible).
However, after the distribution if you decide you need to have a different configuration, starting from Citus 10, you can use the alter_distributed_table function.
alter_distributed_table
has three parameters you can change:
Citus divides your table into shards based on the distribution column you select while distributing. Picking the right distribution column is crucial for having a good distributed database experience. A good distribution column will help you parallelize your data and workload better by dividing your data evenly and keeping related data points close to each other.However, choosing the distribution column might be a bit tricky when you’re first getting started. Or perhaps later as you make changes in your application, you might need to pick another distribution column.
With the distribution_column
parameter of the new alter_distributed_table
function, Citus 10 gives you an easy way to change the distribution column.
Let’s say you have customers and orders that your customers make. You will create some Postgres tables like these:
CREATE TABLE customers (customer_id BIGINT, name TEXT, address TEXT);
CREATE TABLE orders (order_id BIGINT, customer_id BIGINT, products BIGINT[]);
When first distributing your Postgres tables with Citus, let’s say that you decided to distribute the customers
table on customer_id
and the orders
table on order_id
.
SELECT create_distributed_table ('customers', 'customer_id');
SELECT create_distributed_table ('orders', 'order_id');
Later you might realize distributing the orders
table by the order_id
column might not be the best idea. Even though order_id
could be a good column to evenly distribute your data, it is not a good choice if you frequently need to join the orders
table with the customers
table on the customer_id
. When both tables are distributed by customer_id
you can use colocated joins, which are very efficient compared to joins on other columns.
So, if you decide to change the distribution column of orders
table into customer_id
here is how you do it:
SELECT alter_distributed_table ('orders', distribution_column := 'customer_id');
Now the orders
table is distributed by customer_id
. So, the customers and the orders of the customers are in the same node and close to each other, and you can have fast joins and foreign keys that include the customer_id
.
You can see the new distribution column on the citus_tables view:
SELECT distribution_column FROM citus_tables WHERE table_name::text = 'orders';
Shard count of a distributed Citus table is the number of pieces the distributed table is divided into. Choosing the shard count is a balance between the flexibility of having more shards, and the overhead for query planning and execution across the shards. Like distribution column, the shard count is also set while distributing the table. If you want to pick a different shard count than the default for a table, during the distribution process you can use the citus.shard_count configuration variable, like this:
CREATE TABLE products (id BIGINT, name TEXT);
SET citus.shard_count TO 20;
SELECT create_distributed_table ('products', 'id');
After distributing your table, you might decide the shard count you set was not the best option. Or your first decision on the shard count might be good for a while but your application might grow in time, you might add new nodes to your Citus cluster, and you might need more shards. The alter_distributed_table
function has you covered in the cases that you want to change the shard count too.
To change the shard count you just use the shard_count
parameter:
SELECT alter_distributed_table ('products', shard_count := 30);
After the query above, your table will have 30 shards. You can see your table’s shard count on the citus_tables
view:
SELECT shard_count FROM citus_tables WHERE table_name::text = 'products';
When two Postgres tables are colocated in Citus, the rows of the tables that have the same value in the distribution column will be on the same Citus node. Colocating the right tables will help you with better relational operations. Like the shard count and the distribution column, the colocation is also set while distributing your tables. You can use the colocate_with
parameter to change the colocation.
SELECT alter_distributed_table ('products', colocate_with := 'customers');
Again, like the distribution column and shard count, you can find information about your tables’ colocation groups on the citus_tables
view:
SELECT colocation_id FROM citus_tables WHERE table_name IN ('products', 'customers');
You can also use default
and none
keywords with colocate_with
parameter to change the colocation group of the table to default, or to break any colocation your table has.
To colocate distributed Citus tables, the distributed tables need to have the same shard counts. But if the tables you want to colocate don’t have the same shard count, worry not, because alter_distributed_table
will automatically understand this. Then your table’s shard count will also be updated to match the new colocation group’s shard count.
Here is a tip! If you want to change multiple properties of your distributed Citus tables at the same time, you can simply use multiple parameters of the alter_distributed_table
function.
For example, if you want to change both the shard count and the distribution column of a table here’s how you do it:
SELECT alter_distributed_table ('products', distribution_column := 'name', shard_count := 35);
If your Postgres table is colocated with some other tables and you want to change the shard count of all of the tables to keep the colocation, you might be wondering if you have to alter them one by one… which is multiple steps.
Yes (you can see a pattern here) the Citus tip is that you can use the alter_distributed_table
function to change the properties of all of the colocation group.
If you decide the change you make with the alter_distributed_table
function needs to be done to all the tables that are colocated with the table you are changing, you can use the cascade_to_colocated
parameter:
SET citus.shard_count TO 10;
SELECT create_distributed_table ('customers', 'customer_id');
SELECT create_distributed_table ('orders', 'customer_id', colocate_with := 'customers');
-- when you decide to change the shard count
-- of all of the colocation group
SELECT alter_distributed_table ('customers', shard_count := 20, cascade_to_colocated := true);
You can see the updated shard count of both tables on the citus_tables
view:
SELECT shard_count FROM citus_tables WHERE table_name IN ('customers', 'orders');
Another amazing feature introduced in Citus 10 is columnar storage. This Citus 10 columnar blog post walks you through how it works and how to use columnar tables (or partitions) with Citus—complete with a Quickstart. Oh, and Jeff made a short video demo about the new Citus 10 columnar functionality too—it’s worth the 13 minutes to watch IMHO.
With Citus columnar, you can optionally choose to store your tables grouped by columns—which gives you the benefits of compression, too. Of course, you don’t have to use the new columnar access method—the default access method is "heap" and if you don’t specify an access method, then your tables will be row-based tables (with the heap access method.)
It would not be fair to introduce this cool new Citus columnar access method without also giving you a way to convert your tables to columnar. So Citus 10 also introduced a way to change the access method of tables.
SELECT alter_table_set_access_method('orders', 'columnar');
You can use alter_table_set_access_method to convert your table to any other access method too, such as heap
, Postgres’s default access method. Also, your table doesn’t even need to be a distributed Citus table. You can also use alter_table_set_access_method
with Citus reference tables as well as regular Postgres tables. You can even change the access method of a Postgres partition with alter_table_set_access_method
.
If you’ve read the blog post about undistribute_table, the function Citus 9.5 introduced for turning distributed Citus tables back to local Postgres tables, you mostly know how the alter_distributed_table
and alter_table_set_access_method
functions work. Because we use the same underlying methodology as the undistribute_table
function. Well, we improved upon it.
The alter_distributed_table
and alter_table_set_access_method
functions:
Dropping a table for the purpose of re-creating the same table with different properties is not a simple task. Dropping the table will also drop many things that depend on the table.
Just like the undistribute_table
function, the alter_distributed_table
and alter_table_set_access_method
functions do a lot to preserve the properties of the table you didn’t want to change. The functions will handle indexes, sequences, views, constraints, table owner, partitions and more—just like undistribute_table.
alter_distributed_table
and alter_table_set_access_method
will also recreate the foreign keys on your tables whenever possible. For example, if you change the shard count of a table with the alter_distributed_table
function and use cascade_to_colocated := true
to change the shard count of all the colocated tables, then foreign keys within the colocation group and foreign keys from the distributed tables of the colocation group to Citus reference tables will be recreated.
If you want to learn more about our previous work which we build on for alter_distributed_table
and alter_table_set_access_method
functions go check out our blog post on undistribute_table.
In Citus 10 we worked to give you more tools and more capabilities for making changes to your distributed database. When you’re just starting to use Citus, the new alter_distributed_table
and alter_table_set_access_method
functions—along with the undistribute_table
function—are all here to help you experiment and find the database configuration that works the best for your application. And in the future, if and when your application evolves, these three Citus functions will be ready to help you evolve your Citus database, too.