Performance Boosting

Citus DB partitions an incoming query into smaller queries, and sends these smaller queries to worker nodes for parallel processing. The worker nodes then apply PostgreSQL's standard planning and execution logic for these smaller SQL queries. Therefore, any optimization that helps PostgreSQL also helps Citus DB.

The first set of such optimizations relates to configuration settings for the database. PostgreSQL by default comes with conservative resource settings; and among these settings, shared_buffers is probably the most important one. This setting defines the amount of memory allocated to the database, and defaults to 128 MB. On worker nodes with 2 GB or more RAM, this value should be set to at least 512 MB.

all-nodes# sudo sysctl -w kernel.shmmax=629145600
all-nodes# emacs -nw /opt/citusdb/4.0/data/postgresql.conf

shared_buffers = 512MB

Note that in addition to the PostgreSQL config, you also need to change the shared memory kernel parameter here. Further, this kernel parameter doesn't get persisted across reboots, and you need to edit the /etc/sysctl.conf file to permanently save this parameter's value.

Apart from shared_buffers, several other configuration settings directly impact query performance. These settings are temp_buffers, work_mem, effective_cache_size, and default_statistics_target; and increasing their values generally results in higher performance. In addition to these, PostgreSQL also allows configuring dozens of other configuration entries. These settings are covered in more detail in the PostgreSQL manual and are also discussed in the PostgreSQL 9.0 High Performance book.

When configuring these settings, you also want to ensure that either the autovacuum daemon is enabled, or that you run the ANALYZE command regularly. The PostgreSQL query planner relies on statistical information about the contents of tables in order to generate good plans for queries; and these statistics are gathered when ANALYZE is run, either manually or as part of the autovacuum daemon.

Lastly, data's physical layout on disk significantly impacts performance. PostgreSQL stores rows on disk according to their insertion order, and relies on indexes to answer queries that involve selecting a subset of data. This however causes unnecessary disk I/O when table rows are notably smaller in size than the database blocks holding them. For example, if one database block holds eight table rows and the database needs to find five particular rows, then the database ends up fetching five blocks (forty rows).

If the database table has a natural dimension on which most queries are issued, this data layout (or lack thereof) hurts performance. PostgreSQL allows reordering of data's physical layout on disk through the CLUSTER command. This command creates a temporary table that has its rows ordered according to an existing index's sort order, and swaps the actual table with the temporary one once the command is done. Later insertions into the table therefore aren't clustered.

Citus DB saves cluster commands issued against tables, but doesn't run the command after data uploads. You therefore need to manually log in to worker nodes and issue the CLUSTER command to cluster all shards in a database.