Using custom types with Citus & Postgres, from popular hack to transparent feature

Written by Nils Dijk
June 25, 2020

Custom types—called user-defined types in the PostgreSQL docs—are a powerful Postgres capability that, just like Postgres extensions, were envisioned from Day One in the original design of Postgres. Published in 1985, the Design of Postgres paper stated the 2nd design goal as: “provide user extendibility for data types, operators and access methods.”

It’s kind of cool that the creators of Postgres laid the foundation for the powerful Postgres extensions of today (like PostGIS for geospatial use cases, Citus for scaling out Postgres horizontally, pg_partman for time-based partitioning, and so many more Postgres extensions) way back in 1985 when the design of Postgres paper was first published.

Design of Postgres paper
Page 1 of the original Design of Postgres paper, ~1985, by Michael Stonebraker and Lawrence Rowe

This post is not supposed to be a history lesson in Postgres though, rather it’s about how:

  • custom types (aka user-defined types), and
  • Postgres extensions
  • Postgres functions & procedures

work with Citus—and more generally, how Citus handles dependency propagation in a distributed Postgres database cluster.

For those of you who are new to the Citus blog, Citus is an open source extension to Postgres that transforms Postgres into a distributed database. Developers use Citus to scale out Postgres horizontally when they need more performance or more scale than they can get from a single server—for use cases like HTAP workloads, real-time analytics dashboards, multi-tenant SaaS SaaS applications, time series workloads, and high-throughput transactional applications.

Why custom types matter

If you have already read The Design of Postgres paper and are looking for more background on how Postgres came to be—and specifically why things like custom types have helped fuel the usefulness of Postgres—you might want to listen to Michael Stonebraker’s acceptance speech for his Turing award, where he explains why custom types were such an important design consideration for Postgres. The bit about the genesis of custom types starts around 15:30, but honestly the whole talk is good.

Or as Postgres committer Thomas Munro puts it:

“The idea that custom types can be fully integrated with the Postgres database, used with indexes, understood by the PostgreSQL query planner—it’s pretty special. Most other databases don’t have something as good as this.”

When you think about the benefits of custom types in Postgres, it’s useful to note that custom types are not just about data types. What I mean is, custom types also are a key enabler to Postgres extensions. Think about it: Postgres extensions like PostGIS and HyperLogLog (a super useful sketch algorithm, often referred to as HLL) allow you to use powerful, specialized algorithms in Postgres—and in turn, PostGIS and HLL use custom types to define the data structures on which those algorithms operate. So in many ways, you can't have powerful Postgres extensions without also having custom types.

About composite types in Postgres

In your application, defining a custom type can be useful to capture a bunch of things that belong together. For example, in Postgres you can combine latitude and longitude to create a composite type of a point.

Similarly, a common composite type in Postgres is stdaddr, which consists of the element of an address. At first glance you might think that’s just a street name, number, city, country, and postcode but of course you’ll quickly realize there are more components than that.

Why ENUM types are so useful in Postgres

If you have a list of defined/acceptable values for a thing, then you can create a custom type in Postgres that is an enumerated type, called an ENUM type. With an ENUM data type, you need to define the type first, and then you can use it when you go to create your tables.

Let’s use an example of data centers.

Let’s say your application needs to track (among other things) which data center your servers are located in, and you want to make sure that the data center information is correct (not misspelled, not incorrectly entered, etc.)

One way to impose that is by using database constraints; another way is to create a custom data type in Postgres, with type ENUM.

Samay Sharma leads our solutions engineering team and often recommends custom ENUM types to Postgres users—because in addition to ensuring that the field’s values are correct—the ENUM custom types can often use less storage space. For example, let’s say the names of your data centers is a ~30 character text field and your database has 100 million rows—storing this 30 character field in 100 million rows would take up 2.88 GB of disk space.

  • 30 character string is 31 bytes per row
  • for 100M rows, 30 characters: 31100106/10243 = 2.88GB

But with an ENUM type, Postgres will store the 3 datacenters as 1, 2, or 3 in each row of the table—which takes up only 0.373 GB of storage space.

  • enum with 3 values is 4 bytes per row
  • for 100M rows w/enum w/3 values: 4100106/10243 = 0.373 GB

Side-note on calculating the sizes of the 30 character strings, and ENUM types

Here is the SQL I used to calculate the size of a 30 character string, it’s a bit of a neat trick, I thought it might be useful to share:

SELECT pg_column_size('the text is 30 characters long');
┌────────────────┐
 pg_column_size 
├────────────────┤
             31 
└────────────────┘
(1 row)

Time: 8.758 ms

And here is the SQL I used to calculate the storage space needed by the ENUM custom type with 3 values:

CREATE TYPE myenum AS ENUM ('foo', 'bar', 'baz');
CREATE TYPE
Time: 33.877 ms

SELECT pg_column_size('foo'::myenum);
┌────────────────┐
 pg_column_size 
├────────────────┤
              4 
└────────────────┘
(1 row)

Time: 0.804 ms

run_command_on_workers was a necessary friend prior to Citus 9.0

The fact that every node in a Citus database cluster is a fully capable PostgreSQL server means you can do almost anything you can do with Postgres, with a little extra effort.

For a long time, Citus users have been able to use custom types—and Postgres extensions—in their distributed tables on Citus by manually creating the custom types (or the Postgres extensions) on each of the Citus worker nodes, using the Citus run_command_on_workers command.

Prior to Citus 9.0, you would have to run the run_command_on_workers command—as in the SQL example below—on the Citus coordinator node, in order to manually propagate custom Postgres types to all of the worker nodes in your cluster:

SELECT run_command_on_workers($$
    CREATE TYPE datacenter AS ENUM ('datacenter-east-americas-DCA1', 
    'datacenter-west-americas-SJC1', 'datacenter-central-europe-AMS1');
$$);

If you're wondering what the $$ characters are doing in the SQL above, $$ is a marker that Postgres uses to find the begin and end of the string.

However if you need to include a string literal in the SQL command, you will need to use backquotes to escape special characters such as the ' (single quote) characters, which would make the SQL really hard to read. By using the $$ markers, you can more easily read the SQL you are passing in if the SQL contains a string, as the example does above.

The general syntax is $bringyourownmarker$ ... $bringyourownmarker$, it just so happens that your marker can be an empty string, meaning $$ ... $$ is also valid.

And—prior to Citus 9.0—if you didn’t manually propagate the custom Postgres type onto the worker nodes in your Citus cluster, you would get an error when you tried to use that custom type:

SELECT create_distributed_table('servers','server_id');
ERROR:  type "public.datacenter" does not exist
CONTEXT:  while executing command on localhost:9701
Time: 39.375 ms

And even after you got your custom types (and/or Postgres extension) working on all the nodes in the Citus cluster thanks to run_command_on_workers (example above), things could still get a little tricky later, if you ever wanted to add a node to the Citus cluster.

When you try to add a node, if you were using reference tables, the new Citus worker node could not be successfully added until reference table was copied over. But if the reference table relied on a custom type in Postgres, you would have a bit of a “catch-22”, because Citus couldn’t create the reference table copy on the new worker node, because the new node didn’t have the custom type. The solution was to file a support ticket, so we could help to propagate the user-defined types to the new worker manually.

Now in Citus 9.x, custom types and Postgres extensions are automatically propagated

The good news is that with Citus 9.0 and 9.1, custom types and Postgres extensions (and Postgres functions and procedures, too) are automatically propagated to new and existing worker nodes—without any special steps on your part. It just works. You no longer need to run_command_on_worker. You no longer will encounter ‘type does not exist’ ERRORs. You will no longer have to get help to manually propagate your custom types and your Postgres extensions to the rest of the Citus database cluster. These headaches are now gone.

In short this new dependency propagation feature in Citus (which applies to custom types, Postgres extensions, functions, and procedures) makes operating your Citus cluster oh so much easier.

Of course, the pain of encountering ‘type does not exist’ ERRORs and having to run_command_on_workers in order to manually propagate these custom types was in no way as bad as trying to shard Postgres yourself.

With or without this automated dependency propagation feature added in Citus 9.0 and Citus 9.1, Citus takes away a lot of your headaches when it comes to scaling Postgres. This Windows telemetry petabyte-scale story is an awesome example of how teams are using Citus and Postgres to scale to sizes unimaginable a decade ago.

Bottom line, in our recent Citus releases such as Citus 9.2, focused on HTAP workloads as well as Citus 9.3, we continue to pay attention to operational issues to make it easier for you to scale out Postgres horizontally with Citus.

If you’re interested in HOW we implemented dependency propagation in a distributed PostgreSQL cluster, be sure to let us know @citusdata on Twitter. I’m thinking about writing a follow-on post that explores how we designed and implemented Citus custom type propagation.

Nils Dijk

Written by Nils Dijk

Senior Postgres and Citus software engineer at Microsoft. Systems to provide social insights at Bottlenose. Dispatching stack at Uber focused on distributed computing. Enjoys making cocktails. Clarified Whisky Sour anyone?

@ThaNODnl thanodnl