AWS: hedging against AZ failure

After a lot of reading about AWS and the failures that have happened over the years, I’ve come to the conclusion that to be truly resilient against complete AZ failure, you need to have enough capacity running in both AZs to handle the entire load of your application.

Take for example the standard dual-AZ design for a web service. You could build an ASG that is split across the AZs. Half of your servers are in one AZ and half are in the other, and all servers are behind an ELB. One day, AZ1 comes crashing down. “No problem,” you say, “let me just spin up a few more web servers in AZ2.”

Uh, yeah. You and about 100,000 other organizations are all trying to do the same thing. Do you think AWS has the kind of excess capacity to handle that onslaught?

Hmmm, maybe.

If you’re running in us-east-1, AWS has 5 AZs. Assuming all AZs have the same capacity, and if load is spread evenly enough (AWS does randomize the AZs that you see in your UI for just this reason), the loss of 1 AZ would mean that everything running there would have to be absorbed by 4 other AZs. That would mean that each AZ would have to be running at no more than 80% capacity so that it can absorb 1/4 of a failed AZ’s load.

It is probably safe to assume that not everything will migrate to another AZ, for a variety of reasons:

  • customer implementations that only take advantage of a single AZ for some or all of their environment
  • test instances, forgotten instances (you know you have a few!)
  • sysadmin inattentiveness
  • services that are already fully 2x provisioned so they already have 100% capacity spun up in the second AZ

So let’s say 1/3 of the instances that are running fall into this category (this is a wild-ass guess). Now the AZs can run at about 85% of capacity and still be able to absorb their 1/4 share of the load coming from the other AZ. Does AWS have that kind of capacity?

They might. I don’t think that 15% headroom would be unreasonable. AWS could shut down all the spot instances and start charging on-demand for those instances and reclaim some capacity that way. So maybe there’s enough room for everybody on the ark.

And what if two AZs go down? Or if you’re in a region with fewer AZs? Using the assumption above that only 2/3 of the load needs to migrate to another AZ, here is the amount of headroom that AWS needs to reserve in each AZ in order to handle the specified number of failures.

Note that the model is a little simplistic for the multi-AZ failure scenarios — it doesn’t consider the fact that some environments will just be wiped off the map (all environments designed for two AZs where both AZs have gone down)

I kind of doubt that AWS actually has 40% headroom in their ca-central-1 availability zones. That wouldn’t be very cost effective. Your odds get better the more AZs there are in a region. And with multiple AZs lost, I’m not sure any of the regions could handle the workload from the failed AZs.

So let’s assume you pick a region like us-east-1 with 5 AZs. You should be in pretty good shape. But wait, there’s one more problem to worry about…

What happens when the AWS UI or the core AWS API servers can’t respond to the flood of requests? Can you even get your requests processed to spin up your instances? This has been a common side effect of AWS failures — the platform itself starts to wilt. When that happens, you won’t be able to manage your infrastructure, either manually or via AWS API calls. One would hope that AWS-internal mechanisms like autoscaling groups would still function, but who knows?

All this is to say that you might want to consider carefully how you architect your environment.

  • If you use a pilot light design, where you have the bare minimum running in AZ2 (maybe just enough to keep the DB synchronized), might not be useful if AZ1 goes down completely, and you can’t spin up the rest of your AZ2 environment.
  • If you use a low-capacity standby, where you have smaller versions of your services running in AZ2, you could have trouble bringing up enough resources to go from low capacity to full capacity.
  • If you use an active/active configuration where you split your servers across the two AZs, you may only have 1/2 capacity in each AZ. If you lose an AZ, you might not be able to get the second AZ up to full capacity.

Running both AZs with full capacity comes at a cost, obviously. But if you’re using a dual-AZ design, the main (only?) reason for doing so is to hedge against the possibility that an entire AZ is lost. So you’ve already committed to extra spend in order to minimize the impact of an AZ failure. So if that AZ fails, do you want your application to run with full capacity, or at a greatly reduced capacity (which for many applications is almost the same as being down completely).

Also, if you look at your monthly invoices, you may find that your primary cost is bandwidth, in which case the additional EC2 cycles required to maintain full capacity in the second AZ don’t really matter in the grand scheme of things.

Leave a Reply

Your email address will not be published. Required fields are marked *