POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Ozgun Erdogan
January 26, 2018
Today, we’re excited to announce our latest release of our distributed database—Citus 7.2. With this release, we’re making Citus more of a drop-in replacement for your single-node Postgres database, so you don’t need to adapt your SQL for a distributed system.
For multi-tenant applications where the single-tenant queries were scoped to a single machine, Citus already provided full SQL support. . The improvements in Citus 7.2 take our support for distributed SQL one big step further. With Citus database version 7.2, we now extend our distributed SQL support to queries that run on data spread across a cluster of machines. This becomes particularly important for real-time analytics workloads, where even the most complex SELECT queries need to be parallelized across machines.
If you’re into bulleted lists, here’s the quick overview of what’s new in Citus database version 7.2 for distributed queries that span across machines. For an overview of other recent Citus features check out these blogs about distributed transactions and Citus 7.1.
For all these new features in Citus 7.2, you can try them today by downloading Citus open source on your local machine or on our fully-managed database service in the cloud. Or keep reading to learn more about new Citus database features.
Common Table Expressions (CTEs) provide a way to write a series of statements for use in a larger query. Each statement can be thought of as a temporary table that exists only during that transaction. You can create multiple of these statements, which then allow for clear building blocks and make it easy to follow what you’re doing.
CTEs are also known as WITH queries. You can read more about them in the Postgres documentation or find more examples in Craig’s blog post.
Citus 7.2 now provides coverage for all SELECT statements that can appear within a CTE. For example, let’s say you copied all GitHub event data into Citus. With the following query, you can find the latest commits for the top 5 contributors to a particular GitHub repo. This CTE first finds all commits in a repo, picks authors from those commits, and orders the authors by their commit counts.
-- get the latest commits from the top 5 contributors in a particular repo
PREPARE latest_commits(text) AS
WITH postgres_commits AS (
SELECT
created_at, jsonb_array_elements(payload->'commits') AS comm
FROM
github.events
WHERE
repo->>'name' = $1 AND payload->>'ref' = 'refs/heads/master'
),
commits_by_author AS (
SELECT
created_at, comm->'author'->>'name' AS author, comm->>'message' AS message
FROM
postgres_commits
),
top_contributors AS (
SELECT
author, count(*)
FROM
commits_by_author
GROUP BY 1 ORDER BY 2 DESC LIMIT 5
)
SELECT
author, created_at, string_agg(message,'\n') AS latest_messages
FROM
commits_by_author c JOIN top_contributors USING (author)
WHERE
created_at = (SELECT max(created_at) FROM commits_by_author WHERE author = c.author)
GROUP BY
author, created_at, top_contributors.count
ORDER BY
top_contributors.count DESC;
EXECUTE latest_commits ('postgres/postgres');
EXECUTE latest_commits ('citusdata/citus');
Citus 7.2 further extends coverage for queries that include complex subqueries. In particular, Citus 7.2 adds support for two common subquery types. First, you can now join subqueries with tables or other subqueries, including joins between non co-located subqueries.
Second, Citus now provides support for all subqueries in the WHERE clause. You frequently asked to get better insights from the underlying data. We listened. For example, you might want to find which users causes events of types A and B (funnel queries). Or, you might want to find the most visited 25 pages for your webpage and count the number of distinct sessions viewing those webpages. With Citus 7.2, you can now easily answer those questions at scale.
-- How many distinct sessions viewed the most visited 25 web pages?
SELECT page_id, count(distinct session_id)
FROM visits
WHERE page_id in (
SELECT page_id
FROM visits
GROUP BY page_id
ORDER BY count(*) desc
LIMIT 25
)
GROUP BY page_id;
In PostgreSQL, users can combine the results of two queries using the set operations union, intersection, and difference. The syntax for set operations is simple; and includes two queries.
UNION
appends the results of a second query to the results of the first one. INTERSECT
returns all rows that are in the results of both the first and second queries. EXCEPT returns all rows that are in the result of the first query, but not in the result of the second query.
Citus database 7.2 allows you to freely use set operations in your queries. Restrictions in previous versions of Citus have been removed.
One of the benefits of using Citus is that you can also use it as regular Postgres. In fact, some of our users follow a hybrid approach. They shard their larger tables, make tables that join with these large tables as reference tables, and keep remaining tables as Postgres tables.
When you follow this hybrid approach, however, you may run into scenarios where you need to join your Citus and Postgres tables. With Citus 7.2, you can now join local and distributed tables by wrapping the query touching the local table in a CTE statement or subquery block. You no longer need to create a temporary table on the coordinator node for the join.
Citus 7.2 also adds support for joins that include non-equality clauses. Joins with non-equality clauses usually come up in three different contexts:
For example, the following query finds all accounts that were created before a particular time interval. For that, the query joins a distributed and a reference table, and includes a compound join statement that compares an account’s creation time with a time range.
SELECT b.created_at AS dt, b.fk_account_id, tool,
case when pm.fk_plan_id is null then 1
else pm.fk_plan_id end AS fk_plan_id, b.duration_ms
FROM
ls.account_broadcasts b
JOIN ls.account_plan_membership pm
ON b.fk_account_id = pm.fk_account_id and b.created_at <@ pm.during;
One of the most exciting features in PostgreSQL 10 is declarative partitioning. When we announced Citus 7, we added native support for partitioning. With these changes, you could shard your tables on their user or device id columns, and further partition those sharded tables on time.
This way, you could build your scalable time-series database on Citus and leverage native partitioning’s benefits. In particular, you could expire your old data faster, and get higher ingest ratios when your indexes don’t fit in memory.
Still, native partitioning has the difficulty that it takes effort to manage partitions. pg_partman helps with partition management: it’s a widely used extension to create and manage time-based partition sets. pg_partman also applies a retention policy to optionally drop old partitions and includes a background worker process to run partition maintenance without the need for an external scheduler.
With Citus 7.2, we’re excited to add support for pg_partman through broadening our transaction support. You can read more about how to use Citus and pg_partman to create a scalable time-series database.
This blog post covers the highlights in Citus 7.2. There are many more changes in this release and you can read the complete list in the Citus GitHub repo. Three additional features in Citus 7.2 worth mentioning are:
We’re excited to release Citus 7.2, helping you to never have to worry about scaling your database again. As always, the Citus extension to Postgres is available for open source downloads, as enterprise on-prem software, and in the cloud as a managed database service.
If you give Citus a spin, we’d also love your feedback. And if you have feature requests, we’re all ears. Feel free to join the conversation in our slack channel and let us know what you think!