Citus and pg_partman: Creating a scalable time series database on Postgres

Years ago Citus used to have multiple methods for distributing data across many nodes (we actually still support both today), there was both hash-based partitioning and time-based partitioning. Over time we found big benefits in further enhancing the features around hash-based partitioning which enabled us to add richer SQL support, transactions, foreign keys, and more. Thus in recent years, we put less energy into time-based partitioning. But… no one stopped asking us about time partitioning, especially for fast data expiration. All that time we were listening. We just thought it best to align our product with the path of core Postgres as opposed to branching away from it.

Postgres has had some form of time-based partitioning for years. Though for many years it was a bit kludgy and wasn’t part of core Postgres. With Postgres 10 came native time partitioning, and because Citus is an extension to Postgres that means anyone using Citus gets to take advantage of time-based partitioning as well. You can now create tables that are distributed across nodes by ID and partitioned by time on disk.

We have found a few postgres extensions that make partitioning much easier to use. The best in class for improving time partitioning is pg_partman and today we’ll dig into getting time partitioning set up with your Citus database cluster using pg_partman.

Time-based partitioning enables fast data expiration and smaller indexes

One of the most important benefits of time-based partitioning is that it becomes very efficient to drop old data. Citus can efficiently parallelise delete operations and subsequent vacuums and can therefore be orders of magnitude faster than Postgres, but it still needs to read all the data it deletes. Dropping a partition is the equivalent of simply removing the files that contain the data, which is a fast operation, independent of the data size.

The other benefit comes when your queries primarily access recent data, but you still want to keep an archive. In that case, partitioning can have several performance benefits over keeping all data in one table. In particular, because each partition is indexed separately, queries only have a small index to traverse and the index is more likely to be in the cache.

Setting up sharding by ID and partitioning by time

With Postgres 10 you have a new set of commands to help you set up time partitioning. The key to get it started is to specify what column you’ll partition your table on when you set it up. Let’s create a table for data from the GitHub archive.

CREATE SCHEMA github; -- using an explicit schema is required by pg_partman

CREATE TABLE github.events (
  event_id bigint,                                   
  event_type text,
  event_public boolean,
  repo_id bigint,
  payload jsonb,
  repo jsonb, actor jsonb,
  org jsonb,
  created_at timestamp
 ) PARTITION BY RANGE (created_at);

You’ll notice PARTITION BY RANGE (created_at), here we’re starting to tell Postgres that we’re going to be partitioning this table.

Next we’re going to tell Citus to shard our table by repo_id. Each shard will contain a subset of all GitHub repositories.

SELECT create_distributed_table('github.events', 'repo_id');

Now we set up a distributed partitioned table.

A partitioned table cannot contain data itself, it is more like a view across the partitions. You need to manually create the partitions and specify the time range after which you can insert data for that time range, e..g.:

CREATE TABLE github.events_2016 PARTITION OF github.events FOR VALUES FROM ('2016-01-01') TO ('2016-12-31');

We made sure that Citus neatly propagates all partitioning-related commands to the shards of distributed tables. When you create a new partition in a partitioned table, Citus actually creates a new distributed table with its own shards, and each shard will follow the same partitioning hierarchy.

A better time partitioning user experience: pg_partman

Native partitioning is useful, but using it becomes much more pleasant by leveraging the pg_partman extension. The core functionality of pg_partman works out-of-the-box with Citus 7.2+ when using it for native partitioning.

To install pg_partman you’ll need to build it then run CREATE SCHEMA partman; and CREATE EXTENSION pg_partman WITH SCHEMA partman. Once you’ve installed pg_partman you’ll have new functions that you can use to help with managing your time partitioning.

Now we’re going to tell Postgres what interval we want to create our partitions at and tell it to create our initial empty partitions:

-- Partition the table into hourly ranges of created_at
SELECT partman.create_parent('github.events', 'created_at', 'native', 'hourly');
UPDATE partman.part_config SET infinite_time_partitions = true;

By default create_parent creates 4 partitions in the past, and 4 in the future, and 1 for the present, all based on system time. If you need to backfill older data, you can specify a p_start_partition parameter in the call to create_parent, or p_premake to make partitions for the future. See the pg_partman documentation for details.

You can now see all the partitions using \d+ github.events.

citus=> \d+ github.events
                                                Table "github.events"
    Column    |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description
--------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
 event_id     | bigint                      |           |          |         | plain    |              |
 event_type   | text                        |           |          |         | extended |              |
 event_public | boolean                     |           |          |         | plain    |              |
 repo_id      | bigint                      |           |          |         | plain    |              |
 payload      | jsonb                       |           |          |         | extended |              |
 repo         | jsonb                       |           |          |         | extended |              |
 actor        | jsonb                       |           |          |         | extended |              |
 org          | jsonb                       |           |          |         | extended |              |
 created_at   | timestamp without time zone |           |          |         | plain    |              |
Partition key: RANGE (created_at)
Partitions: github.events_p2018_01_15 FOR VALUES FROM ('2018-01-15 00:00:00') TO ('2018-01-16 00:00:00'),
            github.events_p2018_01_16 FOR VALUES FROM ('2018-01-16 00:00:00') TO ('2018-01-17 00:00:00'),
            github.events_p2018_01_17 FOR VALUES FROM ('2018-01-17 00:00:00') TO ('2018-01-18 00:00:00'),
            github.events_p2018_01_18 FOR VALUES FROM ('2018-01-18 00:00:00') TO ('2018-01-19 00:00:00'),
            github.events_p2018_01_19 FOR VALUES FROM ('2018-01-19 00:00:00') TO ('2018-01-20 00:00:00'),
            github.events_p2018_01_20 FOR VALUES FROM ('2018-01-20 00:00:00') TO ('2018-01-21 00:00:00'),
            github.events_p2018_01_21 FOR VALUES FROM ('2018-01-21 00:00:00') TO ('2018-01-22 00:00:00'),
            github.events_p2018_01_22 FOR VALUES FROM ('2018-01-22 00:00:00') TO ('2018-01-23 00:00:00'),
            github.events_p2018_01_23 FOR VALUES FROM ('2018-01-23 00:00:00') TO ('2018-01-24 00:00:00')

Automate your table maintenance

As time progresses, pg_partman will need to do some maintenance do create new partitions and drop old ones. All this can be done through a single function call:

SELECT run_maintenance(p_analyze := false);

 -- note: disabling analyze is recommended for native partitioning due to aggressive locks

You will want to set up a periodic job to run the maintenance function. You can do this from your app such that you can easily put alarms on errors, or use pg_cron to schedule it from the database itself:

SELECT cron.schedule('@hourly', $$SELECT partman.run_maintenance(p_analyze := false)$$);

Once periodic maintenance is set up, partitioning becomes very much a zero-touch experience.

Expiring old data by automatically dropping partitions

To configure pg_partman to drop old partitions, you can update the partman.part_config table:

UPDATE partman.part_config SET retention_keep_table = false, retention = '1 month' 
WHERE parent_table = 'github.events';

Now whenever maintenance runs, partitions older than a month are automatically dropped.

When does time partitioning make sense

Most applications have some usage of time-based data. It could be as simple as the created_at on a record, or it could be event/log data. The default approach of many is to jump to partitioning by time in all cases, but just because you have time series data doesn’t mean it’s the best fit.

Time partitioning makes most sense when you:

  1. Most frequently query a very small subset of your most recent data
  2. You actively delete/drop older data

Having many partitions that are read frequently can cause more overhead than it takes away, but if, for example, you need to keep a year of time series data and only regularly query the last week, using partitioning can make a lot of sense.

Be aware that native partitioning in Postgres is still quite new and has a few quirks. For example, you cannot directly create an in index on a partitioned table. Instead, pg_partman lets you create a template table to define indexes for new partitions. Maintenance operations on partitioned tables will also acquire aggressive locks that can briefly stall queries. There is currently a lot of work going on within the postgres community to resolve these issues, so expect time partitioning in Postgres to only get better.

A distributed relational time series database: Postgres with Citus and pg_partman

Postgres with Citus is already a great database for time series data, especially for use cases such as dashboards for real-time analytics and monitoring. You can use high performance data ingestion, create aggregation tables in parallel, and run advanced SQL queries in parallel across all your data.

Partitioning your distributed time series tables by time with pg_partman provides further optimisation to reduce the cost of queries on the most recent data, time-ordered writes, and data expiration. As with any optimisation, partitioning should not be applied prematurely, but pg_partman makes it very easy and the combination of scale out and logical partitioning on top of a relational database is an immensely powerful tool for dealing with large volumes of time series data.

Have any questions about whether time partitioning with Citus scale-out capabilities could be valuable for you? Drop us a note.