CitusDB partitions SQL queries into smaller commutative ones and helps you get linear performance as you scale out and, in some cases, it may help you get even more

One of our attentive readers called out on a performance "discrepancy" we had in our earlier **benchmark results**. We noted that the CitusDB worker node had its data partitioned into 4 shards to make use of all 4 CPU cores. Apart from that, the worker node was a vanilla PostgreSQL database. Then, we showed performance numbers for TPC-H query #1, where query run times improved by 6x. If we had 4 CPU cores, we'd expect at most 4x. What gives?

The answer to that questions has two pieces. First, CitusDB re-plans each incoming query for distributed execution before partitioning it. In the process, the database translates the entire query into smaller *commutative* queries. For example, an avg() becomes a sum() and count(), and gets executed that way on each shard. Then, the results are post-processed behind the covers, and get returned to the user.

That's where things get interesting. It turns out that PostgreSQL executes sum() and count() functions much faster than it does avg()s on bigint and numeric types. Specifically, you can issue a sum/count for a numeric type on PostgreSQL, and get your results about 60% faster than an avg. As an example, here are two queries on PostgreSQL 9.1 going over 10 GB worth of data using one CPU:

postgres=# SELECT avg(l_quantity) FROM lineitem; avg --------------------- 25.50156219649 (1 row) Time: 44832.810 ms postgres=# SELECT sum(l_quantity)/count(l_quantity) AS avg FROM lineitem; avg --------------------- 25.50156219649 (1 row) Time: 28684.747 ms

Of course, this performance difference becomes visible only when your working set fits into memory. Still, we were curious as to the root cause of this, and ran GProf on PostgreSQL to take a deeper look. Here are the summarized profile results for avg() and sum()/count() calls.

postgres=# SELECT avg(l_quantity) FROM lineitem; ----------------------------------------------- [12] 33.21 60012150 do_numeric_avg_accum [12] 15.18 60012150/60012160 numeric_add [13] 9.86 60012150/60013124 numeric_inc [15] 4.30 60012150/60012150 construct_array [23] 1.48 60012150/60012160 deconstruct_array [35] ----------------------------------------------- postgres=# SELECT sum(l_quantity)/count(l_quantity) AS avg FROM lineitem; ----------------------------------------------- [11] 19.50 120024300 advance_transition_function [11] 14.90 60012140/60012140 numeric_add [12] 1.37 60012150/60012170 datumCopy [30] 0.00 60012150/60012150 int8inc [40] -----------------------------------------------

From this and the source code, we find out two reasons for the performance difference. First, the numeric average function uses an array of two numerics to keep track of the current sum and count. And on every call to the average function (called for every row), the function needs to deserialize this array's members, calculate the new sum and count, and then serialize these two numerics back into the array. This is costly.

Second, the average function uses numeric_inc to increase the current count. This function ends up being notably more expensive than the one that increments plain integers. And these two performance overheads make sum/counts go around 60% faster than an avg for bigint and numeric types.

In summary, CitusDB partitions SQL queries into smaller commutative ones, and helps you get linear performance as you scale out. And in some cases, it may help you get even more. :)