Citus Cloud Retrospective for 12/14/2017

On Thursday December 14th we experienced a service outage across all of Citus Cloud, our fully managed database as a service. This was our first system wide outage since we launched the service in April of 2016.

We know that Citus Cloud customers trust us with their data, and its availability is critical. In this case we missed the mark and we’re sorry. We’ve worked over the last few days to thoroughly understand what went wrong and steps we can take to limit the likelihood of this problem occurring again. In addition, we found improvements we can make to our architecture and incident response processes to allow us to better respond to similar types of problems in the future.

What happened

Citus Cloud is composed of three parts:

  1. The underlying instances that run your database
  2. A state machine that continuously monitors your systems to ensure availability
  3. A console that you use to interact with and administer your cluster

Two days before the outage (December 12th) we performed maintenance on the systems which power our state machine to upgrade the Postgres instance that supports it. To do this we captured a backup of our existing Postgres instance, made a fresh Postgres 10.1 database, then restored our data into the new Postgres 10.1 database. With the restore complete we pushed a new release of our state machine.

During the final phase of the release we received an email alert that part of the release phase failed, but upon initial inspection everything seemed to perform as expected. With everything seemingly stable we left the systems as is and resumed other tasks.

Incident Timeline on Thursday

[2:10 PM PST] On Thursday at we came back to the state machine systems to clean-up our old believed-to-be legacy Postgres 9.6 database. When we removed this database our state machine began failing to be able to establish connections to the underlying instances that power Citus clusters.

For security purposes we rotate our Postgres superuser credentials every 24 hrs. Because, the final stage of our release process failed we didn’t noticed the old database still being attached as the primary, thus our data skew began 2 days earlier. With 48 hrs having elapsed since the upgrade the superuser passwords were out of sync and our now primary database was using incorrect credentials when connecting to an instance to check availability.

[2:25 PM] As the system began detecting failures it viewed the underlying instances as offline and began rebooting them.

[2:30 PM] As our systems alerted us to the failures we disabled our state machine from rebooting servers. Our troubleshooting uncovered the out of date data in our state machine database, and we opted to keep our state machine disabled so all instances could restart and stabilize. As instances came online connections to the coordinator were succeeding, but connections to the Citus data nodes were still failing.

[2:55 PM] From here we connected to the instances and discovered that pgbouncer, the connection pooling software that manages outbound connections to our data nodes, had failed to come back online after restarts. As we restarted pgbouncer cluster availability was restored. We then scripted the restart process to pgbouncer and peer reviewed before running against all Citus clusters. Once all pgbouncers were restarted availability was restored to all Citus clusters.

After availability was restored to all instances we then turned our attention to the out of date state machine databases and monitoring. As soon as we discovered the inconsistent database we had initiated a point in time recovery of our old Postgres 9.6 database. Once this was available to us we created a diff of any new entries between the two and recovered them to our new Postgres 10 database. With the database state now consistent we re-enabled monitoring and reporting to the Citus console.

What we’re doing about it

There are a number of learnings and changes we’re taking away from this incident:

  1. We’re making changes to our release management infrastructure. For config changes specifically we’re going to rely less on automation which currently groups many steps together. Instead we will perform config changes manually. In many cases we prefer to rely on automation over manual processes which are error prone, but for infrequent config changes such as these we feel safer treating them separately. Config changes such as Postgres version upgrades happen less than once a year for us. For common procedures we would typically put a playbook/checklist in place, but in the case of upgrades such as this one the playbook risks being stale by the next time we use it and causing damage as well.

  2. All config changes will be peer reviewed. All code changes are peer reviewed prior to deployment, config changes did not have the same treatment prior to this event. Going forward we will have a process in place to review and approve all config changes before they occur.

    The alert we received on a bad release we mistakenly believed was related to something much less significant such as asset precompilation. As future deploys all succeeded it was viewed as an intermittent error. By both performing these major config changes manually, as well as peer reviewing config changes we aim to catch the issues we missed in this case.

  3. We will be implementing a circuit breaker into our state machine to prevent catastrophic sweeping issues that affect all clusters at a time. In those cases we want our systems to alert an operator before further intervention

  4. When past incidents occurred, it was easy for us to directly communicate with all affected users. Citus has grown and we realize that approach hasn’t scaled with us. As such, we’re making improvements to our incident communications. In particular you can now subscribe to our status site and get regular updates from there for future incidents.

In conclusion

For those of you who trust us with your data our goal is to make it so you can sleep easy at night. We lost some of your trust with this incident. We will be working hard to restore that trust. If you have questions on any of the details of the incident or our steps we’re taking to improve please feel free to contact me directly [email protected]