Citus 7.2: CTEs, complex subqueries, set operations, and more

Written by Ozgun Erdogan
January 26, 2018

Today, we’re excited to announce our latest release of our distributed database—Citus 7.2. With this release, we’re making Citus more of a drop-in replacement for your single-node Postgres database, so you don’t need to adapt your SQL for a distributed system.

For multi-tenant applications where the single-tenant queries were scoped to a single machine, Citus already provided full SQL support. . The improvements in Citus 7.2 take our support for distributed SQL one big step further. With Citus database version 7.2, we now extend our distributed SQL support to queries that run on data spread across a cluster of machines. This becomes particularly important for real-time analytics workloads, where even the most complex SELECT queries need to be parallelized across machines.

If you’re into bulleted lists, here’s the quick overview of what’s new in Citus database version 7.2 for distributed queries that span across machines. For an overview of other recent Citus features check out these blogs about distributed transactions and Citus 7.1.

What’s New in Citus 7.2

  • Common Table Expressions (CTEs)
  • Complex subqueries
  • Set operations (UNION, INTERSECT, …)
  • Joins between distributed and local tables through CTEs
  • Joins that include non-equality clauses
  • Partition management automation with pg_partman

For all these new features in Citus 7.2, you can try them today by downloading Citus open source on your local machine or on our fully-managed database service in the cloud. Or keep reading to learn more about new Citus database features.

Common Table Expressions (CTEs)

Common Table Expressions (CTEs) provide a way to write a series of statements for use in a larger query. Each statement can be thought of as a temporary table that exists only during that transaction. You can create multiple of these statements, which then allow for clear building blocks and make it easy to follow what you’re doing.

CTEs are also known as WITH queries. You can read more about them in the Postgres documentation or find more examples in Craig’s blog post.

Citus 7.2 now provides coverage for all SELECT statements that can appear within a CTE. For example, let’s say you copied all GitHub event data into Citus. With the following query, you can find the latest commits for the top 5 contributors to a particular GitHub repo. This CTE first finds all commits in a repo, picks authors from those commits, and orders the authors by their commit counts.

-- get the latest commits from the top 5 contributors in a particular repo
PREPARE latest_commits(text) AS
WITH postgres_commits AS (
  SELECT                                                                                                                                                      
    created_at, jsonb_array_elements(payload->'commits') AS comm 
  FROM
    github.events
  WHERE
    repo->>'name' = $1 AND payload->>'ref' = 'refs/heads/master'
),
commits_by_author AS (
  SELECT
    created_at, comm->'author'->>'name' AS author, comm->>'message' AS message
  FROM
    postgres_commits
),
top_contributors AS (                                                                                                                                         
  SELECT                                                                                                                                                      
    author, count(*)
  FROM
    commits_by_author
  GROUP BY 1 ORDER BY 2 DESC LIMIT 5
)
SELECT
  author, created_at, string_agg(message,'\n') AS latest_messages
FROM
  commits_by_author c JOIN top_contributors USING (author)
WHERE
  created_at = (SELECT max(created_at) FROM commits_by_author WHERE author = c.author)
GROUP BY
  author, created_at, top_contributors.count
ORDER BY
  top_contributors.count DESC;

EXECUTE latest_commits ('postgres/postgres');
EXECUTE latest_commits ('citusdata/citus');

Complex subquery support

Citus 7.2 further extends coverage for queries that include complex subqueries. In particular, Citus 7.2 adds support for two common subquery types. First, you can now join subqueries with tables or other subqueries, including joins between non co-located subqueries.

Second, Citus now provides support for all subqueries in the WHERE clause. You frequently asked to get better insights from the underlying data. We listened. For example, you might want to find which users causes events of types A and B (funnel queries). Or, you might want to find the most visited 25 pages for your webpage and count the number of distinct sessions viewing those webpages. With Citus 7.2, you can now easily answer those questions at scale.

-- How many distinct sessions viewed the most visited 25 web pages?

SELECT page_id, count(distinct session_id)
FROM visits
WHERE page_id in (
   SELECT page_id
   FROM visits
   GROUP BY page_id
   ORDER BY count(*) desc
   LIMIT 25
)
GROUP BY page_id;

Set operations (UNION, INTERSECT, EXCEPT)

In PostgreSQL, users can combine the results of two queries using the set operations union, intersection, and difference. The syntax for set operations is simple; and includes two queries.

UNION appends the results of a second query to the results of the first one. INTERSECT returns all rows that are in the results of both the first and second queries. EXCEPT returns all rows that are in the result of the first query, but not in the result of the second query.

Citus database 7.2 allows you to freely use set operations in your queries. Restrictions in previous versions of Citus have been removed.

Joins between local and distributed tables (through CTEs)

One of the benefits of using Citus is that you can also use it as regular Postgres. In fact, some of our users follow a hybrid approach. They shard their larger tables, make tables that join with these large tables as reference tables, and keep remaining tables as Postgres tables.

When you follow this hybrid approach, however, you may run into scenarios where you need to join your Citus and Postgres tables. With Citus 7.2, you can now join local and distributed tables by wrapping the query touching the local table in a CTE statement or subquery block. You no longer need to create a temporary table on the coordinator node for the join.

Joins that include non-equality clauses

Citus 7.2 also adds support for joins that include non-equality clauses. Joins with non-equality clauses usually come up in three different contexts:

  • Geo-spatial joins in PostGIS
  • Range operators, particularly when comparing time ranges
  • Comparing for inequality between column values

For example, the following query finds all accounts that were created before a particular time interval. For that, the query joins a distributed and a reference table, and includes a compound join statement that compares an account’s creation time with a time range.

SELECT b.created_at AS dt, b.fk_account_id, tool,
       case when pm.fk_plan_id is null then 1
       else pm.fk_plan_id end AS fk_plan_id, b.duration_ms
FROM
       ls.account_broadcasts b
       JOIN ls.account_plan_membership pm
       ON b.fk_account_id = pm.fk_account_id and b.created_at <@ pm.during;

Partition management automation with pg_partman

One of the most exciting features in PostgreSQL 10 is declarative partitioning. When we announced Citus 7, we added native support for partitioning. With these changes, you could shard your tables on their user or device id columns, and further partition those sharded tables on time.

This way, you could build your scalable time-series database on Citus and leverage native partitioning’s benefits. In particular, you could expire your old data faster, and get higher ingest ratios when your indexes don’t fit in memory.

Still, native partitioning has the difficulty that it takes effort to manage partitions. pg_partman helps with partition management: it’s a widely used extension to create and manage time-based partition sets. pg_partman also applies a retention policy to optionally drop old partitions and includes a background worker process to run partition maintenance without the need for an external scheduler.

With Citus 7.2, we’re excited to add support for pg_partman through broadening our transaction support. You can read more about how to use Citus and pg_partman to create a scalable time-series database.

Full list of changes in Citus 7.2

This blog post covers the highlights in Citus 7.2. There are many more changes in this release and you can read the complete list in the Citus GitHub repo. Three additional features in Citus 7.2 worth mentioning are:

  • Auto-recovery for failed 2PC transactions
  • Support for count(distinct) in queries within a subquery
  • Improvements to COPY performance

Let us know what you think about Citus 7.2

We’re excited to release Citus 7.2, helping you to never have to worry about scaling your database again. As always, the Citus extension to Postgres is available for open source downloads, as enterprise on-prem software, and in the cloud as a managed database service.

If you give Citus a spin, we’d also love your feedback. And if you have feature requests, we’re all ears. Feel free to join the conversation in our slack channel and let us know what you think!

Ozgun Erdogan

Written by Ozgun Erdogan

Co-founder & CTO of Citus Data. Former Postgres engineering team director at Microsoft. Worked on distributed systems at Amazon. Speaker at PGCon, XLDB Conf, DataEngConf, PostgresOpen, & QCon. Dad.