Hyperscale (Citus) is now GA as a built-in deployment option in Azure Database for PostgreSQL. Read Craig’s blog
Our beloved Structured Query Language may be the lingua franca for relational databases—but like many languages, SQL is in a state of constant evolution. The first releases of SQL didn’t even have a notation for joins. At the time, SQL only supported inner joins.
Cross Joins and Where Filters
As a result, back in early eighties, the only way to express a join condition between tables would be in the WHERE clause.
select companies.name, campaigns.name, campaigns.monthly_budget from companies, campaigns where companies.id = campaigns.company_id and company_id = 5;
In that query, the PostgreSQL planner realizes that we are doing an INNER JOIN in between the companies table and the campaigns table, even though what we are actually writing is a CROSS JOIN in between the tables. In PostgreSQL documentation chapter Table Expressions we can read more about the CROSS JOIN operation:
For every possible combination of rows from T1 and T2 (i.e., a Cartesian product), the joined table will contain a row consisting of all columns in T1 followed by all columns in T2. If the tables have N and M rows respectively, the joined table will have N * M rows.
FROM T1 CROSS JOIN T2 is equivalent to FROM T1 INNER JOIN T2 ON TRUE (see below). It is also equivalent to FROM T1, T2.
In my opinion, it is so rare that what you want to achieve actually is a cross join that cross joins should fail your SQL unit testing. After all, code should always express the intentions of the programmer, and if your intention is not to implement a CROSS JOIN then using a comma separated list of tables in your FROM clause looks deadly suspicious.
Inner And Outer Joins in SQL
The previous SQL query can be written properly with an inner join syntax:
select companies.name, campaigns.name, campaigns.monthly_budget from companies inner join campaigns on companies.id = campaigns.company_id where company_id = 5;
Lucky for us, the PostgreSQL planner is smart enough to realize that both queries are specifying the same result set and accordingly executes the same plan to fetch it.
Now, if a company didn’t program any advertising campaigns yet, then the company won’t be listed at all in the output. What if you want to list all companies, whether or not they have campaigns attached to them? In that case we would use an SQL outer join:
select companies.name, campaigns.name, campaigns.monthly_budget from companies left outer join campaigns on companies.id = campaigns.company_id where company_id = 5;
The keywords INNER and OUTER in the SQL syntax above are noise words that the SQL standard allows. These noise words are not mandatory. After all, a LEFT JOIN is always an OUTER join, there’s no choice here. So we can rewrite the query that way:
select companies.name, campaigns.name, campaigns.monthly_budget from companies left join campaigns on companies.id = campaigns.company_id where company_id = 5;
Planning for Joins with the PostgreSQL database
We said that PostgreSQL is smart enough to realize that there’s more than one way to specify the same result set with SQL queries written in different ways.
So it follows that one of the tasks that the PostgreSQL planner must implement consists of finding the most efficient way to execute the JOIN specified in the query.
After all JOIN(A, B, C) gives the same results as JOIN(JOIN(A, B), C) and the same result as JOIN(A, JOIN(B, C)). So how does PostgreSQL decide which order to prefer?
It turns out this problem was solved by a pioneer in relational databases
Patricia Selinger co-published Access path selection in a relational database management system in the Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pp 23-34, where she exposes the algorithm she invented to plan for JOIN operations in SQL. Here’s the summary of the paper:
In a high level query and data manipulation language such as SQL, requests are stated non-procedurally, without reference to access paths. This paper describes how System R chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates. System R is an experimental database management system developed to carry out research on the relational model of data. System R was designed and built by members of the IBM San Jose Research Laboratory.
The Selinger Algorithm is still in use as of today in PostgreSQL source code, and Daniel Gustafsson did an awesome job at presenting the Selinger algorithm in his presentation titled Through The Joining Glass that he delivered at PGConf.EU 2017, last year. If you want to learn more about how PostgreSQL implements JOINs, read Daniel’s slides!
Planning for Distributed Joins with PostgreSQL and Citus
When using Citus to scale out PostgreSQL horizontally (often to reap the performance benefits of a scale-out database architecture)—the same query as above might be fetching data from distributed tables. Because Citus is a PostgreSQL extension, Citus benefits from all the smarts that PostgreSQL provides when planning the query. A nice thing about this step in the planner is the Equivalence Class processing, which allows Citus to leverage the fact that PostgreSQL knows how to read the join condition and where clauses and inject the filters everywhere needed.
When PostgreSQL is done planning the query, Citus knows that there is a join
in between the companies and campaigns table, and also that both
campaigns.company_id = 5 and
company_id = 5.
If you’re developing a multi-tenant application (like a SaaS app) this means you can use Citus to scale out your data and your queries across multiple nodes. You can shard by company id and Citus will know to route the query down to a single shard (on a single node.) This Citus multi-tenant app tutorial does a good job guiding you through the process of using the Citus database to scale out a multi-tenant application.