Daniel Farina

Daniel Farina


Former Principal Engineer at Citus Data & Microsoft. Managed thousands of Postgres databases in the cloud at Citus and Heroku. Creator & maintainer of WAL-E. History lover.

Daniel Farina

How we implement Disaster Recovery and High Availability with Postgres on Citus Cloud

Written by By Daniel Farina | March 23, 2017 Mar 23, 2017

AWS is the leader when it comes to the cloud, and for good reason. AWS is well ahead in the quality and breadth of services they offer.

However, when a service is running at the scale of AWS, it is natural to expect some failures to occur. According to AWS EBS availability is designed for 99.999%.

The annual failure rate (AFR) is 0.1% - 0.2%, where failure means a complete or partial failure. For example, if you had 1,000 EBS discs, you should expect 1 or 2 to have a failure per year. In our experience, partial failure is significantly more common than a complete loss. Even so, a partial loss can take a lot of time to resolve and can still be debilitating to a business.

Over the years, there have been some AWS failures that made news headlines due to havoc caused for both companies and their users. These incidents put a spotlight on AWS’ imperfections.

Keep reading

Page 1 of 1