Citus 12.1 is out! Now with PG16 Support. Read all about it in Naisila’s 12.1 blog post. 💥
It’s a common misconception that high volume read-write workloads in PostgreSQL inevitably causes database inefficiency. We’ve heard of cases where users encounter slowdowns doing only a few hundred writes per second and turn to systems like Dynamo or Cassandra out of frustration. However PostgreSQL can handle these workloads without a problem as long as it is configured correctly.
The problems stem from what's known as "bloat," a phenomenon of PostgreSQL and other MVCC databases which causes increased space usage and decreased performance. We'll see how autovacuum, a tool to combat bloat, is typically misunderstood and misconfigured. By taking a low-level look at PostgreSQL internals we'll arrive at a better autovacuum configuration. Finally we’ll consider how distributing data across a PostgreSQL cluster like Citus can also fight bloat.
Here's where the trouble begins: database designers want to allow read-only transactions in a database to return results without blocking on concurrent updates. Doing so reduces the latency of requests in read-heavy workloads like those common to web applications.
However to allow continuous reads to proceed without pause requires maintaining a different snapshot of the world for some requests, and ultimately reconciling the differences. These white lies incur a space penalty familiar to all liars -- you need a good memory to keep your story straight!
PostgreSQL and other modern relational databases use a technique called Multi-Version Concurrency Control (MVCC) to keep track of snapshots for each transaction, and the space penalty of MVCC is called bloat. PostgreSQL is a bloat machine and will create it without compunction. PostgreSQL needs help from an external tool called "vacuum" to have a chance of cleaning up.
For reasons we'll get into later, bloated tables and indices not only waste space but slow queries down. So it's not a matter of getting a huge hard drive and forgetting about bloat. Where there are updates there is bloat, and it's up to you to vacuum.
It's not as bad as it used to be. In the distant past (pre PostgreSQL 8) DBAs had to vacuum manually. They had to balance the resource consumption of vacuuming against the existing db load to decide when to do it, and potentially when to interrupt it. Nowadays we can configure the "autovacuum" daemon to clean up for us at the most opportune times.
Autovacuum works well when configured correctly. However its default configuration is appropriate for databases of a few hundred megabytes in size, and is not aggressive enough for larger databases. In production environments it starts to fall behind.
Once vacuum has fallen behind it consumes more resources when it does run and it interferes with normal query operation. This can lead to a vicious cycle where database administrators mistakenly reconfigure the "resource hog" autovacuum to make it run less frequently or not at all. Autovacuum is not the enemy, and turning it off is disastrous.
PostgreSQL numbers every new transaction with an increasing identifier (txid). All rows in a table also have hidden columns (xmin, xmax) recording the minimum and maximum transaction ids which are permitted to see the row. You can imagine SELECT statements implicitly including
WHERE xmin <= txid_current() AND (xmax = 0 OR txid_current() < xmax). Rows that no active or future transaction could ever see are considered "dead." That means there are no active transactions with xmin ≤ txid < xmax.
Newly inserted or updated rows use the txid of the transaction that created them for their xmin, and deleted rows set xmax with the txid that deleted them.
begin; select txid_current(); -- suppose it returns 1 create table foo (bar integer); insert into foo (bar) values (100); select xmin, xmax from foo; commit;
┌──────┬──────┐ │ xmin │ xmax │ ├──────┼──────┤ │ 1 │ 0 │ └──────┴──────┘
If we update the row then xmin advances:
begin; update foo set bar = 200; select xmin, xmax from foo; commit;
┌──────┬──────┐ │ xmin │ xmax │ ├──────┼──────┤ │ 2 │ 0 │ └──────┴──────┘
What this doesn't show is that there is now a dead row in the table. Updating a row actually deletes it and inserts a new one with the changed values. The row we are seeing is newly inserted (by txid 2) and the original row is hanging out on disk with xmin=1, xmax=2. We can confirm by asking for information about the tuples (rows) in this table.
create extension pgstattuple; select tuple_count, dead_tuple_count from pgstattuple('public.foo');
┌─────────────┬──────────────────┐ │ tuple_count │ dead_tuple_count │ ├─────────────┼──────────────────┤ │ 1 │ 1 │ └─────────────┴──────────────────┘
PostgreSQL also provides a low level API to view information about physical database pages (chunks of table space stored on disk). This API allows us to see the xmin and xmax of all rows, although due to security considerations the values of deleted rows are not visible.
create extension pageinspect; select t_xmin, t_xmax from heap_page_items(get_raw_page('foo', 0));
┌────────┬────────┐ │ t_xmin │ t_xmax │ ├────────┼────────┤ │ 1 │ 2 │ │ 2 │ 0 │ └────────┴────────┘
At this point you can already see a way generate bloat: just continuously update lots of rows in a table. If autovacuum was turned off the table size would continue to grow even though the number of visible rows would stay the same. Another way to make bloat is to insert a large number of rows in a transaction but rollback rather than commit.
If autovacuum is running it can clean the dead rows unless...the deleted rows are prevented from dying! In this horror movie scenario one transaction stays running for a long time (such as an analytical query) and its txid prevents rows from being marked dead, even if they are deleted by another statement. The long running query doesn't even have to consult the deleted rows, the presence of the rows when the query started ensures they cannot be removed. Mixing OLTP and long running analytics queries is a dangerous cocktail.
Outside of the intractable zombie apocalypse above, autovacuum can keep things under control with proper configuration. Let's look at some of the consequences of bloat before considering autovacuum.
Aside from simply being a waste of space, bloat hurts query speed. Every table and index is stored as an array of fixed-size pages (usually 8KB). When a query requests rows, the database loads these pages into memory. The more dead rows per page the more wasted I/O during loading. For instance a sequential scan must load and pass over all the dead rows.
Bloat also makes it less likely that the active tuples for a query will fit in memory at once. Bloat makes live tuples sparser per physical page hence more pages are required in memory for the same number of live rows. This causes swapping and makes certain query plans and algorithms ineligible for execution.
One nasty case of table bloat is PostgreSQL's own system catalogs. Catalogs can bloat because they are tables too. One way cause this to happen is to cycle through temporary tables, constantly creating and destroying them. This causes constant updates to the catalog tables. When the catalog is bloated administrative functions slow down and even things like running
\d in psql are slow.
Indices get bloated too. An index is a mapping from data key values to tuple identifiers. These identifiers name a page of the "heap" (aka table storage) and offset within that page. Each tuple is an independent object that needs its own index entry. A row update always creates all-new index entries for the row.
The performance degradation of indices is less severe than that of tables for a few reasons. Index items pointing to a dead tuple can be marked as dead. That leaves the index bloated sizewise, but doesn't lead to unnecessary heap lookups. Also updates to tuples in the heap that do not affect the indexed column(s) use a technique called HOT to provide pointers from the dead tuple to its replacement. This allows queries to reuses old index entries by following pointers across the heap.
The size considerations of index bloat are still significant. For instance a btree index consists of binary tree of pages (the same sized pages as you find holding tuples in the heap). The leaf node pages contain values and tuple identifiers. Uniform random table updates tend to keep a btree index in pretty good shape because it can reuse pages. However lopsided inserts/updates affecting one side of the tree while preserving a few straggling entries can lead to lots of mostly empty pages.
To see whether a btree index is efficiently using its page space you can ask pgstatindex. The average leaf density is the percentage of index leaf page usage:
SELECT avg_leaf_density FROM pgstatindex('btree_index_name');
Autovacuum keeps the database trim and fast. It begins work when certain configurable conditions are met. and takes a break when it detects it is being too intrusive toward regular queries.
For every database in a cluster autovacuum attempts to start a new worker once every
autovacuum_naptime (default 1 minute). It will run at most
autovacuum_max_workers (default 3) at a time.
Each worker looks for a table that needs help. The worker searches for tables where PostgreSQL's statistics records indicate a large enough number of rows have changed relative to the table size. In particular the worker looks for a table where [estimated rows invalidated] ≥ autovacuum_vacuum_scale_factor * [estimated table size] + autovacuum_vacuum_threshold.
The worker starts removing dead tuples from the table and compacting pages. As all workers proceed, autovacuum keeps a counter of I/O "credits" that they are consuming. Different kinds of actions count for varying credits (the values are configurable). When credits used exceed the
autovacuum_vacuum_cost_limit then autovacuum pauses all workers for
Vacuuming is a race against time. When compacting pages, a vacuum worker scans through the heap looking for dead rows and adding them to a list. It uses the list to first remove index entries pointing at those rows, then removes the rows from the heap. If there are many rows to clean and
maintenance_work_mem is limited the worker will not be able to process as many dead rows in each pass and will have to waste time repeating this process more frequently.
This explains one way that autovacuum can fall behind: when many dead rows have accumulated and autovacuum doesn't have enough maintenance work memory to remove them quickly and furthermore gets throttled by vacuum_cost_limit. This can be pronounced on disproportionally large tables in a database. The default database value of autovacuum_vacuum_scale_factor=0.2 may be appropriate for small tables, but it's too large for a big table. You can configure the paramter per table:
ALTER TABLE <tablename> SET autovacuum_vacuum_scale_factor = 0.01;
For a million-row table this means autovacuum would start after ten thousand rows are invalidated rather than two hundred thousand. It helps stop bloat from getting out of control.
Autovacuum can also fall behind when there are more bloated tables to clean than autovacuum_max_workers and all the tables are continuing to bloat. Workers cannot get around to all the tables.
Here are sensible autovacuum adjustments. They won't work for all databases of course, but will move you in the right direction.
|autovacuum max workers||3||5 or 6|
|maintenance work mem||64MB||system ram * 3/(8 * autovacuum max workers)|
|autovacuum vacuum scale factor||0.2||Smaller for big tables, try 0.01|
|autovacuum vacuum threshold||50||Could be larger for small tables|
|autovacuum vacuum cost limit||200||Probably leave it alone|
|autovacuum vacuum cost delay||20ms||Can turn it down if OK with more vacuum I/O load|
After tuning autovacuum you can wait and observe how the database responds. In fact you may want to observe the database over time prior to tuning autovacuum to avoid premature optimization. You're looking for the rate of change and percentage of bloat in tables and indices.
Use these scripts to collect measurements: pgexperts/pgx_scripts. Execute them in a cron job to follow the progress week by week.
Big busy tables have great potential to bloat, both from lower sensitivity to the vacuum scale factor, and generally because of the extent of the row churn. Horizontally splitting a large table into smaller tables can be useful, especially if there are a large number of vacuum workers, since only one worker can vacuum a single table at once. Even so, running more workers requires a larger amount of maintenance work memory. One solution which both splits large tables and increases the capacity for running vacuum workers is using a distributed database comprised of multiple physical PostgreSQL servers and sharded tables.
It’s not just user queries which scale in a distributed database, it’s vacuuming as well. To be fair, if queries are scaling fine on a single PostgreSQL instance and bloat is the only problem then switching to a distributed system is overkill; there are other ways to aggressively fix acute bloat. However the extra vacuuming power is a pleasant side effect of going distributed. It’s easier than ever to distribute a PostgreSQL database using open-source tools like Citus Community Edition.
Alternately you can go a step further and forget about configuring autovacuum yourself by using a managed Postgres database service in the cloud. Update October 2022: My team's Citus database (which scales out Postgres) is available as a managed service in the cloud as Azure Cosmos DB for PostgreSQL.