Citus Data Blog - Articles by Halil Ozan Akgul

Tenant monitoring in Citus & Postgres with citus_stat_tenants

Halil Ozan Akgul — 2023-05-12T16:01:00+00:00

If you have ever used a database like Postgres, you know how important optimization is. Some minor changes in how the database is setup make all the difference between long waiting times and satisfied customers. And one crucial thing you need before doing the optimization is to monitor and understand how your database is being used.

Citus is an extension to Postgres that improves scalability and parallelization by distributing your Postgres database across nodes in a cluster. The Citus database extension is available as open source and as a managed service on the cloud, as Azure Cosmos DB for PostgreSQL. You can track your Citus nodes and the Postgres tables, but Citus 11.3 takes it one step further and introduces a new way to gather insight on your Citus database with tenant monitoring.

The new Citus 11.3 release, among many other features, introduces a new citus_stat_tenants view to track your most active tenants, for those with multi-tenant SaaS applications.

In a multi-tenant SaaS application, the same database stores data from the multiple customers of the application. Each of the customers are often referred as tenants. Usually the data for each tenant is handled separately. When you distribute your Postgres table using the Citus create_distributed_table function, every partition key value represents a tenant.

With the new citus_stat_tenants view you can track:

read query count, SELECT queries,
total query count, SELECT, INSERT, DELETE, and UPDATE queries,
total CPU usage in seconds

In this post, you’ll learn how to monitor your top tenants to make more-informed decisions on your database—and you’ll learn how to configure citus_stat_tenants to best fit your application. This post includes a quickstart and code examples. Let’s explore:

Monitor your top tenants with citus_stat_tenants
Monitoring tenants in real time
Insights on the most active tenants (versus on all tenants)
Citus stores extra data to ensure consistency of tenant-level statistics
Local stats from Citus worker nodes
Optional choice to clear tenant statistics history
Getting started with Citus & Citus tenant monitoring for multi-tenant SaaS

Monitor your top tenants with citus_stat_tenants

Let us say you have a multi-tenant app with Citus database, similar to the one on our Citus 11.3 blog post, and your customers are companies who have some ad campaigns.

CREATE TABLE companies (id BIGSERIAL, name TEXT);
SELECT create_distributed_table ('companies', 'id');

CREATE TABLE campaigns (id BIGSERIAL, company_id BIGINT, name TEXT);
SELECT create_distributed_table ('campaigns', 'company_id');

Each of the companies will be a tenant and the company ids are the tenant ids, or the tenant attributes.

Now we can run some queries to track with the citus_stat_tenants view.

Tip: You need to set the citus.stat_tenants_track to all in all your nodes to be able to track the statistics. You can put the setting in the postgresql.conf file.

Let us run some queries.

INSERT INTO companies (name) VALUES ('GigaMarket');
INSERT INTO campaigns (company_id, name) VALUES (1, 'Crazy Wednesday'), (1, 'Frozen Food Frenzy');
INSERT INTO campaigns (company_id, name) VALUES (1, 'Spring Cleaning'), (1, 'Bread&Butter');
INSERT INTO campaigns (company_id, name) VALUES (1, 'Personal Care Refresh'), (1, 'Lazy Lunch');

INSERT INTO companies (name) VALUES ('White Bouquet Flowers');
INSERT INTO campaigns (company_id, name) VALUES (2, 'Bonjour Begonia'), (2, 'April Selection'), (2, 'May Selection');

INSERT INTO companies (name) VALUES ('Smart Pants Co.');
INSERT INTO campaigns (company_id, name) VALUES (3, 'Short Shorts'), (3, 'Tailors Cut');
INSERT INTO campaigns (company_id, name) VALUES (3, 'Smarter Casual');

SELECT COUNT(*) FROM campaigns WHERE company_id = 1;
count
-------
     6
(1 row)
SELECT name FROM campaigns WHERE company_id = 2 AND name LIKE '%Selection';
      name
-----------------
 April Selection
 May Selection
(2 rows)
UPDATE campaigns SET name = 'Tailor''s Cut' WHERE company_id = 3 AND name = 'Tailors Cut';

Now you can check citus_stat_tenants to see the statistics.

SELECT tenant_attribute,
       read_count_in_this_period,
       query_count_in_this_period,
       cpu_usage_in_this_period
FROM citus_stat_tenants;
tenant_attribute | read_count_in_this_period | query_count_in_this_period | cpu_usage_in_this_period
------------------+---------------------------+----------------------------+--------------------------
 1                |                         1 |                          5 |                 0.000299
 3                |                         0 |                          4 |                 0.000314
 2                |                         1 |                          3 |                 0.000295
(3 rows)

Now you have insight on your tenants’ activities.

Monitoring tenants in real time

Activity trends of your tenants might change in time. A new tenant could be much more active than a year old one. Instead of tracking the activity of your tenants since they entered the database, citus_stat_tenants monitors them within time buckets. Each time period’s query and CPU statistics are counted separately. Once a period ends, that period’s numbers are finalized and only stored for one more period. That means citus_stat_tenants only shows the current and the last period’s statistics.

A period is 60 seconds by default, and you can change it with citus.stat_tenants_period parameter.

Let us say you waited 60 seconds (1 period) and ran new queries.

SELECT companies.name company, campaigns.name campaign
FROM companies JOIN campaigns ON companies.id = campaigns.company_id
WHERE companies.id = 1 AND campaigns.name LIKE '%zy%';
  company   |      campaign
------------+--------------------
 GigaMarket | Crazy Wednesday
 GigaMarket | Frozen Food Frenzy
 GigaMarket | Lazy Lunch
(3 rows)
DELETE FROM campaigns WHERE company_id = 2 AND name = 'April Selection';

If you check the citus_stat_tenants view you can see something like:

SELECT * FROM citus_stat_tenants WHERE tenant_attribute = '1' OR tenant_attribute = '2';
-[ RECORD 1 ]--------------+---------
nodeid                     | 2
colocation_id              | 1
tenant_attribute           | 1
read_count_in_this_period  | 1
read_count_in_last_period  | 1
query_count_in_this_period | 1
query_count_in_last_period | 5
cpu_usage_in_this_period   | 0.000054
cpu_usage_in_last_period   | 0.000299
-[ RECORD 2 ]--------------+---------
nodeid                     | 1
colocation_id              | 1
tenant_attribute           | 2
read_count_in_this_period  | 0
read_count_in_last_period  | 1
query_count_in_this_period | 1
query_count_in_last_period | 3
cpu_usage_in_this_period   | 0.000132
cpu_usage_in_last_period   | 0.000295

Note that in addition to query and CPU usage, citus_stat_tenants also includes these columns:

nodeid: id of the Citus node the tenant lives in, can also be seen in pg_dist_node
colocation_id: Colocation group id for the distributed table the tenant is from
tenant_attribute: id of the tenant (the partition key value)

If you want to create graphs and dashboards with citus_stat_tenants, you can use the “last period” columns for every period. Because “last period” columns have values that cannot be changed, the values in the “last period” columns will be removed after the next period is over.

Insights on the most active tenants (versus on all tenants)

The new tenant monitoring feature of Citus 11.3 is not meant for investigating each and every tenant in your Citus database.

Rather, the new citus_stat_tenants is designed to help you gather information on your most active tenants. The tenant monitoring view will store and show the tenants that ran the most queries the most recently.

If a tenant in your Citus database stops running queries on your distributed Postgres tables, that tenant’s statistics will eventually be removed from the monitor to open space for the more active tenants.

By default, citus_stat_tenants will show 100 tenants. You can change the number of tenants stored with citus.stat_tenants_limit parameter.

Citus stores extra data to ensure consistency of tenant-level statistics

The tenant monitor will show the top citus.stat_tenants_limit number of tenants and only for the current and the last period. But to make the monitor and the statistics consistent and more useful, in the background, citus_stat_tenants stores some more tenants and some extra data related to recency and the number of queries. Citus stores extra data is to make sure scenarios like “an active tenant dropping out of the monitor list, just because the tenant didn’t run queries during one period” does not happen.

Local stats from Citus worker nodes

You can query citus_stat_tenants from any node and the view will show the statistics from all the cluster. However, the tracking of the data is done locally on the Citus node where the tenant resides in.

You might not be interested in the whole Citus cluster’s data, if:

you plan to create a set of graphs for each of your Citus nodes,
you used the tenant isolation feature and are not particularly interested in the isolated tenants’ usages, or
you have a specific Citus node which you are trying to optimize

If you just want to get the monitoring data from the node you are connected to, you can use the citus_stat_tenants_local view.

From Citus worker node with node id 1:

SELECT * FROM citus_stat_tenants_local;
-[ RECORD 1 ]--------------+--
colocation_id              | 1
tenant_attribute           | 2
read_count_in_this_period  | 0
read_count_in_last_period  | 1
query_count_in_this_period | 1
query_count_in_last_period | 3
cpu_usage_in_this_period   | 0.000132
cpu_usage_in_last_period   | 0.000295

Note that you can only see the tenants whose shards are in the Citus node with node id 1 and you no longer have the nodeid column that you had in citus_stat_tenants view.

Optional choice to clear tenant statistics history

The citus_stat_tenants shows the statistics for one period. But if you want to reset the monitor and clear all the data you can use the citus_stat_tenants_reset function.

SELECT citus_stat_tenants_reset();
SELECT COUNT(*) FROM citus_stat_tenants;
count
-------
     0
(1 row)

There is also a function for cleaning local monitor data of a single Citus node, citus_stat_tenants_reset_local(). Keep in mind that resetting the data for only one Citus node might create inconsistent results across the cluster, so for most cases you should not need the local reset function.

Getting started with Citus & Citus tenant monitoring for multi-tenant SaaS

At Citus we try make your database faster every day, which includes giving you the tools to optimize your database performance yourself.

Tenant monitoring is created with this thought in mind. Now that you have the tools to learn more about the activity and usage statistics for the top tenants in your SaaS application, you can make more-informed decisions on how to optimize your Citus database and cluster.

To learn more about Citus 11.3 and tenant monitoring:

What’s new in Citus 11.3 for multi-tenant SaaS apps
Livestream of the Citus 11.3 Release Party, with demos – Will include a demo of tenant monitoring, happening on Mon 15 May @ 9:00am PDT (mark your calendar)
Tenant monitoring section of the 11.3 Updates / Release notes page

And to get started with Citus:

This article was originally published on citusdata.com.

Monitor distributed Postgres activity with citus_stat_activity & citus_lock_waits

Halil Ozan Akgul — 2022-07-21T16:51:00+00:00

We released Citus 11 in the previous weeks and it is packed. Citus went full open source, so now previously enterprise features like the non-blocking aspect of the shard rebalancer—and multi-user support—are all open source for everyone to enjoy. One other huge change in Citus 11 is now you can query your distributed Postgres tables from any Citus node, by default.

When using Citus to distribute Postgres before Citus 11, the coordinator node was your application’s only point of contact. Your application needed to connect to the coordinator to query your distributed Postgres tables. Coordinator node can handle high query throughput, about 100K per second but your application might need even more processing power. Thanks to our work in Citus 11 you can now query from any node in the Citus database cluster you want. In Citus 11 we sync the metadata to all nodes by default, so you can connect to any node and run queries on your tables.

Running queries from any node is awesome but you also need to be able to monitor and manage your queries from any node. Before, when you only connected the coordinator, using Postgres’ monitoring tools was enough but this is not the case anymore. So in Citus 11 we added some ways to observe your queries similar to you would do in a single Postgres instance.

In this blogpost you’ll learn about some new monitoring tools introduced in Citus 11 that’ll help you track and take control of your distributed queries, including:

New global PID for Citus
New citus_stat_activity view
Updated citus_dist_stat_activity view
Updated citus_lock_waits view
Cancel a query from any node with pg_cancel_backend
Even more helper functions

New identifier for Citus processes: Global PID

You can now use the Citus global process id, global PID for short, which is new to Citus 11. The Global PID is just like Postgres’ process id, but this new value is unique across a Citus cluster. We call the new value global process identifier and we use global PID or GPID for short.

To find the global PID of your current backend, you can use our new citus_backend_gpid function:

SELECT citus_backend_gpid();
citus_backend_gpid
--------------------
        110000000123

We tried to make GPIDs human readable, so they consist of the Postgres PID of your current backend and the node id of your current node. Smallest 10 digits of global PID are the Postgres process id—which you can find with pg_backend_pid function—and rest of the digits are the node id, which you can find in the pg_dist_node table.

Figure 1: Global PID, the new identifier number for queries in Citus database clusters, consists of the Citus node id (of the node the query started in) followed by the Postgres PID of the backend of the query.

Global PIDs are unique in a Citus cluster. Also, remember that a distributed query might need to run some queries on the shards. Those shard query executions also get the same GPID. In other words, all the activity of a distributed query can be traced via the same GPID. For example, if you run a SELECT query that has the GPID 110000000123, the queries that will SELECT from the shards will also have 110000000123 as global PID.

Note that global PIDs are big integers where Postgres PIDs are 4 byte integers.

New citus_stat_activity view to give you pg_stat_activity views across a Citus cluster

To find the Citus global PIDs and more information about your Postgres queries in a Citus cluster, you can use our new citus_stat_activity view. citus_stat_activity is a collection of pg_stat_activity views from all nodes in the Citus cluster. When you query the citus_stat_activity it goes to every node and gathers pg_stat_activity views. citus_stat_activity includes all the columns from pg_stat_activity and we added three extra columns:

global_pid: Citus global process id associated with the query.
nodeid: Citus node id of the node the citus_stat_activity row comes from
is_worker_query: Boolean value, showing if the row is from one of the queries that run on the shards.

Let’s say you have a distributed table tbl with 4 shards and you run an update query on it in a node:

BEGIN;
UPDATE tbl SET b = 100 WHERE a = 0;

You can connect to any node and use citus_stat_activity to find info about the UPDATE query.

SELECT global_pid, nodeid, is_worker_query, query
FROM citus_stat_activity
WHERE global_pid = 110000000123;
-[ RECORD 1 ]---+----------------------------------------------------------------------------
global_pid      | 110000000123
nodeid          | 11
is_worker_query | f
query           | UPDATE tbl SET b = 100 WHERE a = 0;
-[ RECORD 2 ]---+----------------------------------------------------------------------------
global_pid      | 110000000123
nodeid          | 2
is_worker_query | t
query           | UPDATE public.tbl_102009 tbl SET b = 100 WHERE (a OPERATOR(pg_catalog.=) 0)

In the output above, you can see:

record 1 is the original query that we ran on node with nodeid 11
record 2 is the query that runs on the shard from node with nodeid 2
both records have the same global_pid
is_worker_query column is true for record 2 and falsefor the original query, record 1.

Don’t forget, citus_stat_activity includes all the columns of pg_stat_activity, not just the ones we filter for demonstrating here. So, you can find much more information in citus_stat_activity view.

Use citus_dist_stat_activity view to get summarized info on your queries

If you are not interested in every single query from all the nodes in the Citus cluster and only care about the original distributed queries you can use citus_dist_stat_activity view.

citus_dist_stat_activity hides the queries that run on the shards from the citus_stat_activity view, so you can find some high level information about your Postgres queries.

SELECT global_pid, nodeid, is_worker_query, query
FROM citus_dist_stat_activity
WHERE global_pid = 110000000123;
-[ RECORD 1 ]---+--------------------------
global_pid      | 110000000123
nodeid          | 11
is_worker_query | f
query           | UPDATE tbl SET b = 100 WHERE a = 0;

As, you might have guessed citus_dist_stat_activity is citus_stat_activity filtered with is_worker_query = false. We created citus_dist_stat_activity because when you are interested in the process as a whole and not each and every process on the shards, then the general information that citus_dist_stat_activity provides about the initial queries should be enough.

Find blocking processes with citus_lock_waits view

When something in your Postgres database is blocked you are in the need of monitoring the most. Citus 11 has you covered when your cluster is blocked too. The newly updated citus_lock_waits shows the queries in your cluster that are waiting for some lock on another query.

Let’s say you run a DELETE query on the tbl that will be blocked on the previous UPDATE query:

DELETE FROM tbl WHERE a = 0;

You can connect to any node and use citus_lock_waits to find out which query is blocking your new query:

SELECT * FROM citus_lock_waits;
-[ RECORD 1 ]-------------------------+-----------------------------
waiting_gpid                          | 20000000345
blocking_gpid                         | 110000000123
blocked_statement                     | DELETE FROM tbl WHERE a = 0;
current_statement_in_blocking_process | UPDATE tbl SET b = 100 WHERE a = 0;
waiting_nodeid                        | 2
blocking_nodeid                       | 11

The result above shows the UPDATE statement is blocking the DELETE statement.

Once you find the blocking queries you can use citus_stat_activity and citus_dist_stat_activity with the global PIDs from citus_lock_waits to gather more insight.

Cancel a Postgres query from any Citus node with pg_cancel_backend

After you find out and get more information about the blocking and blocked queries you might decide you need to cancel one of them. Before Citus 11 you needed to go to the node that the query is being run on, and then use pg_cancel_backend with the process id to cancel.

Now in Citus 11 we override the pg_cancel_backend function to accept global PIDs too.

So, good news, things are now easier, you can cancel queries on your Citus clusters from any node:

SELECT pg_cancel_backend(20000000345);

will cause:

DELETE FROM tbl WHERE a = 0;
ERROR:  canceling statement due to user request

Remember that global PIDs are always big integers and Postgres PIDs are 4-byte integers. The difference in size is how pg_cancel_backend differentiates between a PID and a GPID.

Also, like pg_cancel_backend, Citus 11 overrides pg_terminate_backend to accept global PIDs too. So, you can also terminate queries from different nodes using global PIDs.

More helper functions

In addition to all the Citus activity and lock views mentioned, we added some more smaller functions to help you monitor your database cluster. The new functions try to make it easier for you to get some info that can be useful when writing monitoring queries, including:

Get nodename and nodeport information

You can use citus_nodename_for_nodeid and citus_nodeport_for_nodeid to get info about the node with a node id:

SELECT citus_nodename_for_nodeid(11);
 citus_nodename_for_nodeid
---------------------------
 localhost

SELECT citus_nodeport_for_nodeid(11);
 citus_nodeport_for_nodeid
---------------------------
                      9701

You can find both info above in the pg_dist_node table too.

Parse the GPID into nodeid and Postgres pid components

You can use citus_nodeid_for_gpid and citus_pid_for_gpid to parse a GPID.

SELECT citus_nodeid_for_gpid(110000000123);
 citus_nodeid_for_gpid
-----------------------
                     11

SELECT citus_pid_for_gpid(110000000123);
 citus_pid_for_gpid
--------------------
                123

As I mentioned earlier, we tried to make GPIDs human readable and with the two functions above they are also easily machine readable too.

With the functions above you can find out about a node from a GPID like this:

SELECT citus_nodename_for_nodeid(citus_nodeid_for_gpid(110000000123)),
citus_nodeport_for_nodeid(citus_nodeid_for_gpid(110000000123));
citus_nodename_for_nodeid | citus_nodeport_for_nodeid
---------------------------+---------------------------
localhost                 |                      9701
(1 row)

Use citus_coordinator_nodeid to find the coordinator’s node id

Finally, you can use citus_coordinator_nodeid to find the node id of the coordinator node.

SELECT citus_coordinator_nodeid();
 citus_coordinator_nodeid
--------------------------
                        3

With Citus 11 you can monitor from any node in a Citus database cluster

With Citus 11 you can query your distributed Postgres tables from any node by default. And with the tools you learned about in this blog post, you know how to monitor and manage your Citus cluster from any node, like you would do with a single Postgres instance.

If you’re interested in all that changed in Citus 11 check out:

Marco’s Citus 11 blog post, titled: Citus 11 for Postgres goes fully open source, with query from any node
Citus 11.0 updates page
Recording of the Citus 11 release party, our first livestream release party with demos of several of the new features

And if you want to download Citus you can always find the latest download instructions on the website; we’ve put together lots of useful resources on the Getting Started page; you can file issues in the GitHub repo; and if you have questions please join us (and other users in the community) in the Citus Public Slack.

This article was originally published on citusdata.com.

Citus Tips for Postgres: How to alter distribution key, shard count, & more

Halil Ozan Akgul — 2021-05-03T16:22:00+00:00

Citus is an extension to Postgres that lets you distribute your application’s workload across multiple nodes. Whether you are using Citus open source or using Citus as part of a managed Postgres service in the cloud, one of the first things you do when you start using Citus is to distribute your tables. While distributing your Postgres tables you need to decide on some properties such as distribution column, shard count, colocation. And even before you decide on your distribution column (sometimes called a distribution key, or a sharding key), when you create a Postgres table, your table is created with an access method.

Previously you had to decide on these table properties up front, and then you went with your decision. Or if you really wanted to change your decision, you needed to start over. The good news is that in Citus 10, we introduced 2 new user-defined functions (UDFs) to make it easier for you to make changes to your distributed Postgres tables.

Before Citus 9.5, if you wanted to change any of the properties of the distributed table, you would have to create a new table with the desired properties and move everything to this new table. But in Citus 9.5 we introduced a new function, undistribute_table. With the undistribute_table function you can convert your distributed Citus tables back to local Postgres tables and then distribute them again with the properties you wish. But undistributing and then distributing again is… 2 steps. In addition to the inconvenience of having to write 2 commands, undistributing and then distributing again has some more problems:

Moving the data of a big table can take a long time, undistribution and distribution both require to move all the data of the table. So, you must move the data twice, which is much longer.
Undistributing moves all the data of a table to the Citus coordinator node. If your coordinator node isn’t big enough, and coordinator nodes typically don’t have to be, you might not be able to fit the table into your coordinator node.

So, in Citus 10, we introduced 2 new functions to reduce the steps you need to make changes to your tables:

alter_distributed_table
alter_table_set_access_method

In this post you’ll find some tips about how to use the alter_distributed_table function to change the shard count, distribution column, and the colocation of a distributed Citus table. And we’ll show how to use the alter_table_set_access_method function to change, well, the access method. An important note: you may not ever need to change your Citus table properties. We just want you to know, if you ever do, you have the flexibility. And with these Citus tips, you will know how to make the changes.

Altering the properties of distributed Postgres tables in Citus

When you distribute a Postgres table with the create_distributed_table function, you must pick a distribution column and set the distribution_column parameter. During the distribution, Citus uses a configuration variable called shard_count for deciding the shard count of the table. You can also provide colocate_with parameter to pick a table to colocate with (or colocation will be done automatically, if possible).

However, after the distribution if you decide you need to have a different configuration, starting from Citus 10, you can use the alter_distributed_table function.

alter_distributed_table has three parameters you can change:

distribution column
shard count
colocation properties

How to change the distribution column (aka the sharding key)

Citus divides your table into shards based on the distribution column you select while distributing. Picking the right distribution column is crucial for having a good distributed database experience. A good distribution column will help you parallelize your data and workload better by dividing your data evenly and keeping related data points close to each other.However, choosing the distribution column might be a bit tricky when you’re first getting started. Or perhaps later as you make changes in your application, you might need to pick another distribution column.

With the distribution_column parameter of the new alter_distributed_table function, Citus 10 gives you an easy way to change the distribution column.

Let’s say you have customers and orders that your customers make. You will create some Postgres tables like these:

CREATE TABLE customers (customer_id BIGINT, name TEXT, address TEXT);
CREATE TABLE orders (order_id BIGINT, customer_id BIGINT, products BIGINT[]);

When first distributing your Postgres tables with Citus, let’s say that you decided to distribute the customers table on customer_id and the orders table on order_id.

SELECT create_distributed_table ('customers', 'customer_id');
SELECT create_distributed_table ('orders', 'order_id');

Later you might realize distributing the orders table by the order_id column might not be the best idea. Even though order_id could be a good column to evenly distribute your data, it is not a good choice if you frequently need to join the orders table with the customers table on the customer_id. When both tables are distributed by customer_id you can use colocated joins, which are very efficient compared to joins on other columns.

So, if you decide to change the distribution column of orders table into customer_id here is how you do it:

SELECT alter_distributed_table ('orders', distribution_column := 'customer_id');

Now the orders table is distributed by customer_id. So, the customers and the orders of the customers are in the same node and close to each other, and you can have fast joins and foreign keys that include the customer_id.

You can see the new distribution column on the citus_tables view:

SELECT distribution_column FROM citus_tables WHERE table_name::text = 'orders';

How to increase (or decrease) the shard count in Citus

Shard count of a distributed Citus table is the number of pieces the distributed table is divided into. Choosing the shard count is a balance between the flexibility of having more shards, and the overhead for query planning and execution across the shards. Like distribution column, the shard count is also set while distributing the table. If you want to pick a different shard count than the default for a table, during the distribution process you can use the citus.shard_count configuration variable, like this:

CREATE TABLE products (id BIGINT, name TEXT);
SET citus.shard_count TO 20;
SELECT create_distributed_table ('products', 'id');

After distributing your table, you might decide the shard count you set was not the best option. Or your first decision on the shard count might be good for a while but your application might grow in time, you might add new nodes to your Citus cluster, and you might need more shards. The alter_distributed_table function has you covered in the cases that you want to change the shard count too.

To change the shard count you just use the shard_count parameter:

SELECT alter_distributed_table ('products', shard_count := 30);

After the query above, your table will have 30 shards. You can see your table’s shard count on the citus_tables view:

SELECT shard_count FROM citus_tables WHERE table_name::text = 'products';

How to colocate with a different Citus distributed table

When two Postgres tables are colocated in Citus, the rows of the tables that have the same value in the distribution column will be on the same Citus node. Colocating the right tables will help you with better relational operations. Like the shard count and the distribution column, the colocation is also set while distributing your tables. You can use the colocate_with parameter to change the colocation.

SELECT alter_distributed_table ('products', colocate_with := 'customers');

Again, like the distribution column and shard count, you can find information about your tables’ colocation groups on the citus_tables view:

SELECT colocation_id FROM citus_tables WHERE table_name IN ('products', 'customers');

You can also use default and none keywords with colocate_with parameter to change the colocation group of the table to default, or to break any colocation your table has.

To colocate distributed Citus tables, the distributed tables need to have the same shard counts. But if the tables you want to colocate don’t have the same shard count, worry not, because alter_distributed_table will automatically understand this. Then your table’s shard count will also be updated to match the new colocation group’s shard count.

How to change more than one Citus table property at a time

Here is a tip! If you want to change multiple properties of your distributed Citus tables at the same time, you can simply use multiple parameters of the alter_distributed_table function.

For example, if you want to change both the shard count and the distribution column of a table here’s how you do it:

SELECT alter_distributed_table ('products', distribution_column := 'name', shard_count := 35);

How to alter the Citus colocation group

If your Postgres table is colocated with some other tables and you want to change the shard count of all of the tables to keep the colocation, you might be wondering if you have to alter them one by one… which is multiple steps.

Yes (you can see a pattern here) the Citus tip is that you can use the alter_distributed_table function to change the properties of all of the colocation group.

If you decide the change you make with the alter_distributed_table function needs to be done to all the tables that are colocated with the table you are changing, you can use the cascade_to_colocated parameter:

SET citus.shard_count TO 10;
SELECT create_distributed_table ('customers', 'customer_id');
SELECT create_distributed_table ('orders', 'customer_id', colocate_with := 'customers');
-- when you decide to change the shard count
-- of all of the colocation group
SELECT alter_distributed_table ('customers', shard_count := 20, cascade_to_colocated := true);

You can see the updated shard count of both tables on the citus_tables view:

SELECT shard_count FROM citus_tables WHERE table_name IN ('customers', 'orders');

How to change your Postgres table’s access method in Citus

Another amazing feature introduced in Citus 10 is columnar storage. This Citus 10 columnar blog post walks you through how it works and how to use columnar tables (or partitions) with Citus—complete with a Quickstart. Oh, and Jeff made a short video demo about the new Citus 10 columnar functionality too—it’s worth the 13 minutes to watch IMHO.

With Citus columnar, you can optionally choose to store your tables grouped by columns—which gives you the benefits of compression, too. Of course, you don’t have to use the new columnar access method—the default access method is "heap" and if you don’t specify an access method, then your tables will be row-based tables (with the heap access method.)

It would not be fair to introduce this cool new Citus columnar access method without also giving you a way to convert your tables to columnar. So Citus 10 also introduced a way to change the access method of tables.

SELECT alter_table_set_access_method('orders', 'columnar');

You can use alter_table_set_access_method to convert your table to any other access method too, such as heap, Postgres’s default access method. Also, your table doesn’t even need to be a distributed Citus table. You can also use alter_table_set_access_method with Citus reference tables as well as regular Postgres tables. You can even change the access method of a Postgres partition with alter_table_set_access_method.

Under the hood: How do these new Citus functions work?

If you’ve read the blog post about undistribute_table, the function Citus 9.5 introduced for turning distributed Citus tables back to local Postgres tables, you mostly know how the alter_distributed_table and alter_table_set_access_method functions work. Because we use the same underlying methodology as the undistribute_table function. Well, we improved upon it.

The alter_distributed_table and alter_table_set_access_method functions:

Create a new table in the way you want (with the new shard count or access method etc.)
Move everything from your old table to the new table
Drop the old table and rename the new one

Dropping a table for the purpose of re-creating the same table with different properties is not a simple task. Dropping the table will also drop many things that depend on the table.

Just like the undistribute_table function, the alter_distributed_table and alter_table_set_access_method functions do a lot to preserve the properties of the table you didn’t want to change. The functions will handle indexes, sequences, views, constraints, table owner, partitions and more—just like undistribute_table.

alter_distributed_table and alter_table_set_access_method will also recreate the foreign keys on your tables whenever possible. For example, if you change the shard count of a table with the alter_distributed_table function and use cascade_to_colocated := true to change the shard count of all the colocated tables, then foreign keys within the colocation group and foreign keys from the distributed tables of the colocation group to Citus reference tables will be recreated.

Making it easier to experiment with Citus—and to adapt as your needs change

If you want to learn more about our previous work which we build on for alter_distributed_table and alter_table_set_access_method functions go check out our blog post on undistribute_table.

In Citus 10 we worked to give you more tools and more capabilities for making changes to your distributed database. When you’re just starting to use Citus, the new alter_distributed_table and alter_table_set_access_method functions—along with the undistribute_table function—are all here to help you experiment and find the database configuration that works the best for your application. And in the future, if and when your application evolves, these three Citus functions will be ready to help you evolve your Citus database, too.

This article was originally published on citusdata.com.

Citus Tips: How to undistribute a distributed Postgres table

Halil Ozan Akgul — 2021-02-06T17:35:00+00:00

Once you start using the Citus extension to distribute your Postgres database, you may never want to go back. But what if you just want to experiment with Citus and want to have the comfort of knowing you can go back? Well, as of Citus 9.5, now there is a new undistribute_table() function to make it easy for you to, well, to revert a distributed table back to being a regular Postgres table.

If you are familiar with Citus, you know that Citus is an open source extension to Postgres that distributes your data (and queries) to multiple machines in a cluster—thereby parallelizing your workload and scaling your Postgres database horizontally. When you start using Citus—whether you’re using Citus open source or whether you’re using Citus as part of a managed service in the cloud—usually the first thing you need to do is distribute your Postgres tables across the cluster.

What is undistribute_table()?

To distribute your Postgres tables with the create_distributed_table() function of Citus, you first need to make some decisions, such as: which column to choose as the distribution column, how many shards you need, and which Postgres tables you need to distribute.

If you just want to try different settings and go back when you want to, you're now in luck. Our Citus team introduced the undistribute_table() function in the Citus 9.5 release—enabling you to turn distributed Citus tables back into regular Postgres tables.

If you are one of the Citus users who has asked for the ability to undistribute your Citus tables—like in the request below from Matt Watson of Stackify—we hope this new feature will help you.

Also, is there a way to convert a distributed table to not being distributed? I could then change it back to distributed and fix my colocate… without having to drop the table.

The new undistribute_table() function will:

return all the data of a distributed table from the Citus worker nodes back to the Citus coordinator node,
remove all the shards of the distributed table from the Citus workers,
make the previously distributed table a local Postgres table on the Citus coordinator node

Here is the simplest code example of going distributed with Citus and coming back:

-- First distribute your table
SELECT create_distributed_table ('my_table', 'id');
-- Now your table has shards on the worker nodes and any data that was in the table is distributed to those shards.

-- To go back to local, just call the undistribute_table function with your table as parameter
SELECT undistribute_table('my_table');
-- Now your table is only on the coordinator node just like before you distributed.

Undistributing a Citus table is as simple as the one line of SQL code in the code block above.

Note that when you distribute a Postgres table with Citus you need to pass the distribution column into the create_distributed_table() function—but when undistributing, the only parameter you need to pass into the undistribute_table() function is the table name itself.

After undistributing, the distribution column becomes a regular column. If in the future, you want to distribute your Postgres table again, you can just pick another distribution column (or use the same one).

In the past, before we introduced the undistribute_table() function in Citus 9.5, if you wanted to turn a distributed table back into a local table, you would have had to create a new Postgres table on your coordinator node. Then, you would have needed to move all the data from the distributed table to this new local table. However, Citus did not have an easy way to move data from distributed Citus tables to local Postgres tables so you would have had to do some workarounds. Let me explain:

The usefulness of INSERT INTO local SELECT .. FROM distributed

To undistribute a table, distributed data needs to be moved back to the Citus coordinator from all the shards in the cluster. But prior to the Citus 9.4 release, Citus did not support queries that SELECT from distributed tables and INSERT into local tables. So, there was a need to implement support for:

INSERT INTO local_table SELECT * FROM distributed_table;

In fact, the INSERT INTO local SELECT .. FROM distributed feature was introduced in Citus 9.4 to make the undistribute_table() function possible.

Other than being necessary for undistributing tables, inserting distributed data into local tables has some more beneficial use cases.

Rollup Tables

A rollup table in Postgres is a table that you pre-aggregate your data into. Before we introduced INSERT INTO local SELECT .. FROM distributed in Citus 9.4, you could still have rollup tables. (And many of you did!) But your rollup tables had to be distributed tables, which may not have been the best option in every case. Especially if your rollup table was a very small table.

Let me give you an example.

Let's say you have a distributed table and a graph that shows some daily statistics of the data on that table. Instead of calculating the statistics from scratch every time you open the graph, you can now create a local Postgres table on the Citus coordinator that you will rollup into. Every night, you can calculate the statistics value for the day and insert the result of the calculations to the rollup table. When you open the graph, the data will be readily available.

-- Every midnight
INSERT INTO rollup_table SELECT your_analysis_function(statistics_column) FROM distributed_table WHERE date = CURRENT_DATE;

-- When you need the graph
SELECT * FROM rollup_table;

ETL in the Database

ETL (Extract, Transform, Load) is the process of gathering data from a data source, transforming the data into a more meaningful form, and then storing the transformed data. Imagine running an online store, and imagine you have a distributed table for customer data and another distributed table for purchases the customers made. What if you need to find the best 100 customers and send them e-mails about a special discount for the top customers?

With the new INSERT INTO local SELECT .. FROM distributed feature and the ETL logic, you can create a local Postgres table for your best customers.

-- Create the table for the top customers
CREATE TEMP TABLE top_customers (customer_id bigint primary key, email text, total_purchase money);

-- Find the best customers and put their data into the top_customers table
INSERT INTO top_customers
SELECT customer_id, email, total_purchase
FROM customers JOIN
(
  SELECT sum(amount) AS total_purchase, customer_id
  FROM purchases
  GROUP BY customer_id
) total_purchases ON customers.id = total_purchases.customer_id
ORDER BY total_purchase DESC
LIMIT 100;

-- Load the top customer IDs back into the distributed table
UPDATE customers SET is_top_customer = true WHERE id IN (SELECT customer_id FROM top_customers);

Increased support for INSERT SELECT in Citus

As of Citus 9.4 any INSERT SELECT command works!

The logic for INSERT INTO local SELECT .. FROM distributed queries is quite similar to the logic for SELECT .. FROM distributed. When you just want to get the distributed data with SELECT, Citus will:

gather data from the Citus distributed worker nodes
combine the data, on the Citus coordinator node
return the combined data back to you

If you want to INSERT INTO local SELECT .. FROM distributed, Citus does all the steps the same way, except for the last one. In the last step, instead of returning the combined data to you, Citus inserts the data to the local Postgres table on the Citus coordinator node.

After all the engineering effort, it would be selfish to keep the INSERT INTO local SELECT .. FROM distributed feature just for internal use. So, we added support for the feature in Citus 9.4.

What does the Citus undistribute_table() function do, under the hood?

So as of Citus 9.4, with help from the new INSERT INTO local SELECT .. FROM distributed feature, you could undistribute your tables manually, if you needed to revert. To undistribute Citus tables manually, you used to have to:

Create a new Postgres table
Insert Select everything from your old, distributed table into the new Postgres table
Drop the old table and rename the new table.

That might seem easy enough, but that's not all. Some of the things you might have also had to deal with:

keep the Postgres indexes you had on the old distributed Citus table in mind…
create partitions again, if your table was a partitioned table…
deal with the fact that while dropping your table, you also dropped any views that depended on your distributed table—and the views that depended on those views, too.

The good news is that as of Citus 9.5 or later—you can now use the new undistribute_table() function and let Citus seamlessly handle everything. Specifically, when you use the undistribute_table() function, Citus automatically:

creates the indexes you had for the distributed table,
handles sequences owned by the table so they continue from where you left off,
recursively finds the views that directly or indirectly depend on your table and moves them to the new Postgres table,
preserves constraints, and the table owner,
if your table was a partitioned table, does all these steps for the partitions,
and more…

Bottom line: undistribute_table() makes it easier to experiment with Citus distributed tables

Hopefully it's interesting to know a bit more about why our Citus team introduced the INSERT INTO local SELECT .. FROM distributed feature in Citus 9.4—and the undistribute_table() function in Citus 9.5.

The most important thing to know is that distributing a Postgres table with Citus is not a one-way street. It's easy to go back and to undistribute a Citus table. So if you want to get started with Citus, it's now easier to experiment—as long as you're running Citus 9.5 or later. After downloading the Citus open source packages—or provisioning a Hyperscale (Citus) server group on Azure—you can distribute your tables or make your tables reference tables and then undistribute back to local Postgres tables—and find what data model works best for you and your application. And if you change your mind later, you can just undistribute again.

This article was originally published on citusdata.com.