Introducing pg_auto_failover: A high availability and automated failover Postgres extension

As part of the Citus team (Citus scales out Postgres horizontally, but that’s not all we work on), I've been working on pg_auto_failover for quite some time now and I'm excited that we have now introduced pg_auto_failover as Open Source, to give you automated failover and high availability!

When designing pg_auto_failover, our goal was this: to provide an easy to set up Business Continuity solution for Postgres that implements fault tolerance of any one node in the system. The documentation chapter about the pg_auto_failover architecture includes the following:

It is important to understand that pg_auto_failover is optimized for Business Continuity. In the event of losing a single node, then pg_auto_failover is capable of continuing the PostgreSQL service, and prevents any data loss when doing so, thanks to PostgreSQL Synchronous Replication.

Introduction to pg_auto_failover

The pg_auto_failover solution for Postgres is meant to provide an easy to setup and reliable automated failover solution. This solution includes software driven decision making for when to implement failover in production.

The most important part of any automated failover system is the decision making policy, and we have a whole documentation chapter online about pg_auto_failover fault tolerance mechanisms.

When using pg_auto_failover, multiple active agents are deployed to keep track of your production Postgres setup properties:

the monitor, a Postgres database itself equipped with the pg_auto_failover extension, registers and checks the health of the active Postgres nodes.
each Postgres node that is registered in the pg_auto_failover monitor must also run a local agent, the pg_autoctl run service.
each Postgres service that is managed has two Postgres nodes set up together in the same group. A single monitor setup may manage as many Postgres groups as needed.

With such a deployment, the monitor connects to every registered node on a regular schedule (20s by default) and registers success or failure in its pgautofailover.node table.

In addition to that, the pg_autoctl run service on each Postgres node checks that Postgres is running and watches the pg_stat_replication statistics for the other node. This Postgres system view allows our local agent to discover the network connectivity between primary and standby node. The local agent reports the state of each node to the monitor on a regular schedule, every 5s, unless a transition is needed, and then without delay.

The pg_auto_failover monitor takes decisions depending on the known state of both nodes in a cluster, and only follows a Finite State Machine that we carefully designed to ensure convergence of the nodes. In particular, the FSM only makes progress once the pg_autoctl agents have reported success to implement the decided transition to a new state. The architecture documentation section about failover logic contains images of the FSM we use to ensure automated failover decision making in pg_auto_failover.

Quick Start with pg_auto_failover

Once more, please refer to the Quickstart documentation section of pg_auto_failover for more details. When trying out the project for the first time, the easiest way forward is to create a monitor and then register a primary Postgres instance, then a secondary Postgres instance.

Here's a list of shell commands that implement a simple deployment, all on localhost, for project discovery purposes.

The Monitor

In a first terminal, terminal tab, screen or tmux window, run the following commands to create a monitor, including initializing a Postgres cluster with initdb, installing our pg_auto_failover extension, and opening connection privileges in the HBA file.

First, we prepare the environment in the terminal:

$ mkdir /tmp/pg_auto_failover/test
$ export PGDATA=/tmp/pg_auto_failover/test/monitor

Then we can create the monitor Postgres instance, here on localhost on the port 6000, using the PGDATA environment setting we just prepared:

$ pg_autoctl create monitor --nodename localhost --pgport 6000
12:12:53 INFO  Initialising a PostgreSQL cluster at "/tmp/pg_auto_failover/test/monitor"
12:12:53 INFO  Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/monitor" in the configuration
12:12:53 INFO   /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /tmp/pg_auto_failover/test/monitor --options "-p 6000" --options "-h *" --wait start
12:12:53 INFO  Granting connection privileges on 192.168.1.0/24
12:12:53 INFO  Your pg_auto_failover monitor instance is now ready on port 6000.
12:12:53 INFO  pg_auto_failover monitor is ready at postgres://autoctl_node@localhost:6000/pg_auto_failover
12:12:53 INFO  Monitor has been successfully initialized.

Now we can re-display the connection string to the monitor:

$ pg_autoctl show uri
postgres://autoctl_node@localhost:6000/pg_auto_failover

The Postgres primary node

In another terminal (tab, window, do it your usual way), now create a primary PostgreSQL instance:

$ export PGDATA=/tmp/pg_auto_failover/test/node_a
$ pg_autoctl create postgres --nodename localhost --pgport 6001 --dbname test --monitor postgres://autoctl_node@localhost:6000/pg_auto_failover
12:15:27 INFO  Registered node localhost:6001 with id 1 in formation "default", group 0.
12:15:27 INFO  Writing keeper init state file at "/Users/dim/.local/share/pg_autoctl/tmp/pg_auto_failover/test/node_a/pg_autoctl.init"
12:15:27 INFO  Successfully registered as "single" to the monitor.
12:15:28 INFO  Initialising a PostgreSQL cluster at "/tmp/pg_auto_failover/test/node_a"
12:15:28 INFO  Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/node_a" in the configuration
12:15:28 INFO  Postgres is not running, starting postgres
12:15:28 INFO   /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /private/tmp/pg_auto_failover/test/node_a --options "-p 6001" --options "-h *" --wait start
12:15:28 INFO  CREATE DATABASE test;
12:15:29 INFO  FSM transition from "init" to "single": Start as a single node
12:15:29 INFO  Initialising postgres as a primary
12:15:29 INFO  Transition complete: current state is now "single"
12:15:29 INFO  Keeper has been successfully initialized.

This commands registers a PostgreSQL instance to the monitor, creates the instance with pg_ctl initdb, prepares some connection privileges for the monitor health check, creates a database named test for you. Then the first transition ordered by the monitor is implemented, reaching the state SINGLE from the state INIT.

And now we start the pg_autoctl run service, interactively in the terminal, because we're still testing. For a production setup this would go in a system service that is taken care of a boot time, such as systemd.

$ pg_autoctl run
12:17:07 INFO  Managing PostgreSQL installation at "/tmp/pg_auto_failover/test/node_a"
12:17:07 INFO  pg_autoctl service is starting
12:17:07 INFO  Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is "", WAL delta is -1.

The last line is going to be repeated every 5s, showing that the primary node is healthy and can connect to the monitor all right, as expected. Also, it's in the SINGLE state now, which is going to change as soon as a new Postgres node joins the group.

The Postgres secondary node

It's now time to create a secondary Postgres instance in yet another terminal:

$ export PGDATA=/tmp/pg_auto_failover/test/node_b
$ pg_autoctl create postgres --nodename localhost --pgport 6002 --dbname test --monitor postgres://autoctl_node@localhost:6000/pg_auto_failover
12:21:08 INFO  Registered node localhost:6002 with id 5 in formation "default", group 0.
12:21:09 INFO  Writing keeper init state file at "/Users/dim/.local/share/pg_autoctl/tmp/pg_auto_failover/test/node_b/pg_autoctl.init"
12:21:09 INFO  Successfully registered as "wait_standby" to the monitor.
12:21:09 INFO  FSM transition from "init" to "wait_standby": Start following a primary
12:21:09 INFO  Transition complete: current state is now "wait_standby"
12:21:14 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
12:21:14 INFO  The primary node returned by the monitor is localhost:6001
12:21:14 INFO  Initialising PostgreSQL as a hot standby
12:21:14 INFO  Running /Applications/Postgres.app/Contents/Versions/10/bin/pg_basebackup -w -h localhost -p 6001 --pgdata /tmp/pg_auto_failover/test/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...
12:21:14 INFO  pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/2000028 on timeline 1
pg_basebackup: starting background WAL receiver
32041/32041 kB (100%), 1/1 tablespace                                        
pg_basebackup: write-ahead log end point: 0/20000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed

12:21:14 INFO  Postgres is not running, starting postgres
12:21:14 INFO   /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /tmp/pg_auto_failover/test/node_b --options "-p 6002" --options "-h *" --wait start
12:21:15 INFO  PostgreSQL started on port 6002
12:21:15 WARN  Contents of "/tmp/pg_auto_failover/test/node_b/postgresql-auto-failover.conf" have changed, overwriting
12:21:15 INFO  Transition complete: current state is now "catchingup"
12:21:15 INFO  Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/node_b" in the configuration
12:21:15 INFO  Keeper has been successfully initialized.

This time the registration with the monitor returns the state WAIT_STANDBY, which drives pg_autoctl to create a secondary node. That's because a server already exists in the group and is currently SINGLE. In parallel to that, the monitor assigns the goal state WAIT_PRIMARY to the primary node, where the local pg_autoctl agent is going to retrieve the node name and port of the new node from the monitor database and open pg_hba.conf for replication. When that's done, then the secondary node continues with pg_basebackup, installs a recovery.conf file, starts the local Postgres service and informs the monitor about reaching the goal state.

We're still CATCHING_UP though. This means that automatic failover is not possible yet. To be able to orchestrate a failover, we need to run our local service on the new node, monitoring Postgres health and replication status and reporting to the monitor every 5s:

$ pg_autoctl run
12:26:26 INFO  Calling node_active for node default/5/0 with current state: catchingup, PostgreSQL is running, sync_state is "", WAL delta is -1.
12:26:26 INFO  FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again
12:26:26 INFO  Transition complete: current state is now "secondary"
12:26:26 INFO  Calling node_active for node default/5/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.

Now the new node is in the SECONDARY state and keeps reporting to the monitor, ready to promote the local Postgres instance when the monitor takes the decision.

Automated and Manual Failover with pg_auto_failover

All it takes with pg_auto_failover to provision a PostgreSQL cluster with automated failover is two commands per node: first use pg_autoctl create ... to create the node, and then pg_autoctl run for the local service to implement the transitions decided by the monitor.

To witness a failover, the easiest way is to stop the pg_autoctl run service (using either ^C in terminal where it runs, or pg_autoctl stop --pgdata ... from anywhere else; and then stop the Postgres instance too, using pg_ctl -D ... stop.

When stopping only Postgres, the pg_autoctl run service would detect that situation as abnormal and first try to restart Postgres, of course. A failover would only be deemed appropriate when failing to start Postgres 3 times in a row with the default pg_auto_failover parameters.

Another way to inject a failover condition is to politely ask the monitor to please orchestrate one for you:

$ psql postgres://autoctl_node@localhost:6000/pg_auto_failover
> select pgautofailover.perform_failover();

Connection String for the Application and Clients

This whole setup is running as a formation in pg_auto_failover terms. The default formation is named default, and contains a single group of two Postgres instances. The idea is that there's a single entry point for connecting applications to any given formation. To get the connection string to our pg_auto_failover managed Postgres service, issue the following command, e.g. on the monitor terminal:

$ pg_autoctl show uri --formation default
postgres://localhost:6002,localhost:6001/test?target_session_attrs=read-write

We are using the multiple hosts facility of libpq here. This works with any modern Postgres driver when based off on libpq (that's most of them), and other native drivers are known to implement the same facility, such as the JDBC Postgres driver.

Of course, if works with psql:

$ psql postgres://localhost:6002,localhost:6001/test?target_session_attrs=read-write
psql (12devel, server 10.7)
Type "help" for help.

test# select pg_is_in_recovery();
 pg_is_in_recovery
═══════════════════
 f
(1 row)

When using such a connection string, the connection driver connects to the first host and checks whether it accepts writes, and if that's not the case, then it connects to the second host and checks again. That's because we said that we want target_session_attrs to be read-write.

Using this feature of core Postgres, we have implemented client side High-Availability: in case of a failover, our node_b is going to be the primary and we need the application to now target node_b for writes, and that's done automatically at the connection driver level.

High-Availability, Fault Tolerance, and Business Continuity

So pg_auto_failover is all about Business Continuity and works with a single standby server for each primary Postgres server.

In the classic HA setup for Postgres, we rely on synchronous replication with two standby servers for each primary server. That's the expected architecture when you want to achieve a zero or close to zero RTO and RPO objective.

Also, the idea of using two standby nodes per primary is that you can lose any standby server and still happily accept writes knowing that the data is going to be available at two different places still. This is a very good property to have in many production setup, and it's the target of other existing Postgres HA tools.

In cases though, the best production setup trade-off is different than what current Postgres HA tools support. Sometimes it's ok to face with service interruption while a disaster recovery procedure needs to be performed, because the evaluation of the risk of that situation being necessary is compliant with either the production budget, the expected SLA, or a combination thereof.

Not all projects need more than 99.95% availability, even without going to the last mile with a goal of 99.999% that is sometimes required. Moreover, while IOT and some other use cases such as giant user bases need HA solutions that scales with Terabytes to Petabytes of data, lots of projects are intended to smaller audiences and data sets. When you have Gigabytes of data, or even tens of Gigabytes of data, disaster recovery timings are not impossible to swallow anymore, depending on your SLA terms.

Availability of the Data

pg_auto_failover uses PostgreSQL synchronous replication to guarantee that no data is lost at the time of the failover operation. The sync rep Postgre feature guarantees that when the client application has received the COMMIT message back from Postgres, the data has made it to our secondary node.

pg_auto_failover works correctly when facing the loss of any ONE node in the system. When losing the primary and then also the secondary, there's nothing left short of backups. When using pg_auto_failover, you still have to set up a proper Disaster Recovery solution for the cases when you're losing more than one server at a time. Yes, this happens.

Also watch out for the infamous _file system is full_ which likes to strike both the primary and its secondary at about the same time, given our habits to deploy servers of similar specifications...

Conclusion

Pretty much the entire Citus team here at Microsoft is excited about the open source release of the pg_auto_failover extension. We released pg_auto_failover under the Postgres open source license, so that you can enjoy our contribution in the same exact capacity as when deploying Postgres. The project is fully open and every one is welcome to participate and contribute to it on our GitHub repo at https://github.com/citusdata/pg_auto_failover. We are following the Microsoft Open Source Code of Conduct and will make sure that all of you feel welcome and are listened to.

My hope is that many of you will now be able to deploy Postgres in production with an an automated failover solution, thanks to pg_auto_failover.!

Written by Dimitri Fontaine

PostgreSQL major contributor & author of “The Art of PostgreSQL”. Contributed extension facility & event triggers feature in Postgres. Maintains pg_auto_failover. Speaker at so many conferences.