POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Halil Ozan Akgul
February 6, 2021
This post by Halil Ozan Akgul about Citus was originally published on the Azure Database for PostgreSQL Blog on Microsoft TechCommunity.
Update in October 2022: Citus has a new home on Azure! The Citus database is now available as a managed service in the cloud as Azure Cosmos DB for PostgreSQL. Azure documentation links have been updated throughout the post, to point to the new Azure docs.
Once you start using the Citus extension to distribute your Postgres database, you may never want to go back. But what if you just want to experiment with Citus and want to have the comfort of knowing you can go back? Well, as of Citus 9.5, now there is a new undistribute_table()
function to make it easy for you to, well, to revert a distributed table back to being a regular Postgres table.
If you are familiar with Citus, you know that Citus is an open source extension to Postgres that distributes your data (and queries) to multiple machines in a cluster—thereby parallelizing your workload and scaling your Postgres database horizontally. When you start using Citus—whether you’re using Citus open source or whether you’re using Citus as part of a managed service in the cloud—usually the first thing you need to do is distribute your Postgres tables across the cluster.
To distribute your Postgres tables with the create_distributed_table() function of Citus, you first need to make some decisions, such as: which column to choose as the distribution column, how many shards you need, and which Postgres tables you need to distribute.
If you just want to try different settings and go back when you want to, you're now in luck. Our Citus team introduced the undistribute_table()
function in the Citus 9.5 release—enabling you to turn distributed Citus tables back into regular Postgres tables.
If you are one of the Citus users who has asked for the ability to undistribute your Citus tables—like in the request below from Matt Watson of Stackify—we hope this new feature will help you.
Also, is there a way to convert a distributed table to not being distributed? I could then change it back to distributed and fix my colocate… without having to drop the table.
The new undistribute_table()
function will:
Here is the simplest code example of going distributed with Citus and coming back:
-- First distribute your table
SELECT create_distributed_table ('my_table', 'id');
-- Now your table has shards on the worker nodes and any data that was in the table is distributed to those shards.
-- To go back to local, just call the undistribute_table function with your table as parameter
SELECT undistribute_table('my_table');
-- Now your table is only on the coordinator node just like before you distributed.
Undistributing a Citus table is as simple as the one line of SQL code in the code block above.
Note that when you distribute a Postgres table with Citus you need to pass the distribution column into the create_distributed_table()
function—but when undistributing, the only parameter you need to pass into the undistribute_table()
function is the table name itself.
After undistributing, the distribution column becomes a regular column. If in the future, you want to distribute your Postgres table again, you can just pick another distribution column (or use the same one).
In the past, before we introduced the undistribute_table() function in Citus 9.5, if you wanted to turn a distributed table back into a local table, you would have had to create a new Postgres table on your coordinator node. Then, you would have needed to move all the data from the distributed table to this new local table. However, Citus did not have an easy way to move data from distributed Citus tables to local Postgres tables so you would have had to do some workarounds. Let me explain:
To undistribute a table, distributed data needs to be moved back to the Citus coordinator from all the shards in the cluster. But prior to the Citus 9.4 release, Citus did not support queries that SELECT from distributed tables and INSERT into local tables. So, there was a need to implement support for:
INSERT INTO local_table SELECT * FROM distributed_table;
In fact, the INSERT INTO local SELECT .. FROM distributed
feature was introduced in Citus 9.4 to make the undistribute_table()
function possible.
Other than being necessary for undistributing tables, inserting distributed data into local tables has some more beneficial use cases.
A rollup table in Postgres is a table that you pre-aggregate your data into. Before we introduced INSERT INTO local SELECT .. FROM distributed
in Citus 9.4, you could still have rollup tables. (And many of you did!) But your rollup tables had to be distributed tables, which may not have been the best option in every case. Especially if your rollup table was a very small table.
Let me give you an example.
Let's say you have a distributed table and a graph that shows some daily statistics of the data on that table. Instead of calculating the statistics from scratch every time you open the graph, you can now create a local Postgres table on the Citus coordinator that you will rollup into. Every night, you can calculate the statistics value for the day and insert the result of the calculations to the rollup table. When you open the graph, the data will be readily available.
-- Every midnight
INSERT INTO rollup_table SELECT your_analysis_function(statistics_column) FROM distributed_table WHERE date = CURRENT_DATE;
-- When you need the graph
SELECT * FROM rollup_table;
ETL (Extract, Transform, Load) is the process of gathering data from a data source, transforming the data into a more meaningful form, and then storing the transformed data. Imagine running an online store, and imagine you have a distributed table for customer data and another distributed table for purchases the customers made. What if you need to find the best 100 customers and send them e-mails about a special discount for the top customers?
With the new INSERT INTO local SELECT .. FROM distributed
feature and the ETL logic, you can create a local Postgres table for your best customers.
-- Create the table for the top customers
CREATE TEMP TABLE top_customers (customer_id bigint primary key, email text, total_purchase money);
-- Find the best customers and put their data into the top_customers table
INSERT INTO top_customers
SELECT customer_id, email, total_purchase
FROM customers JOIN
(
SELECT sum(amount) AS total_purchase, customer_id
FROM purchases
GROUP BY customer_id
) total_purchases ON customers.id = total_purchases.customer_id
ORDER BY total_purchase DESC
LIMIT 100;
-- Load the top customer IDs back into the distributed table
UPDATE customers SET is_top_customer = true WHERE id IN (SELECT customer_id FROM top_customers);
As of Citus 9.4 any INSERT SELECT command works!
The logic for INSERT INTO local SELECT .. FROM distributed
queries is quite similar to the logic for SELECT .. FROM distributed
. When you just want to get the distributed data with SELECT, Citus will:
If you want to INSERT INTO local SELECT .. FROM distributed
, Citus does all the steps the same way, except for the last one. In the last step, instead of returning the combined data to you, Citus inserts the data to the local Postgres table on the Citus coordinator node.
After all the engineering effort, it would be selfish to keep the INSERT INTO local SELECT .. FROM distributed
feature just for internal use. So, we added support for the feature in Citus 9.4.
So as of Citus 9.4, with help from the new INSERT INTO local SELECT .. FROM distributed
feature, you could undistribute your tables manually, if you needed to revert. To undistribute Citus tables manually, you used to have to:
That might seem easy enough, but that's not all. Some of the things you might have also had to deal with:
The good news is that as of Citus 9.5 or later—you can now use the new undistribute_table()
function and let Citus seamlessly handle everything. Specifically, when you use the undistribute_table()
function, Citus automatically:
Hopefully it's interesting to know a bit more about why our Citus team introduced the INSERT INTO local SELECT .. FROM distributed
feature in Citus 9.4—and the undistribute_table() function in Citus 9.5.
The most important thing to know is that distributing a Postgres table with Citus is not a one-way street. It's easy to go back and to undistribute a Citus table. So if you want to get started with Citus, it's now easier to experiment—as long as you're running Citus 9.5 or later. After downloading the Citus open source packages—or provisioning a Hyperscale (Citus) server group on Azure—you can distribute your tables or make your tables reference tables and then undistribute back to local Postgres tables—and find what data model works best for you and your application. And if you change your mind later, you can just undistribute again.