There was another regional disruption at a big provider this month. A subset of services in one region degraded for a while, recovered, and got the usual write-up. I wasn't badly hit this time, but it sent me back to a question I should ask more often than I do: what, exactly, do I think happens when a region goes away, and is that belief based on anything I've actually tested?
Because here's the thing about cloud redundancy. There's the redundancy you designed, and there's the redundancy you assumed, and they feel identical right up until an outage tells you which one you've got. Most of mine, when I looked honestly, was the assumed kind.
the comforting lies we tell ourselves
I've caught myself believing all of these at one time or another, and every one of them is a trap.
"It's multi-AZ, so it's fine." Multiple availability zones protect you against one zone failing. They do nothing for a regional control-plane problem, which is precisely the failure mode that makes the news. If your whole stack is in one region across three zones, a regional event takes all three. You have redundancy against the failure that rarely happens and none against the one that just did.
"It's managed, so they handle failover." Sometimes. Read the actual SLA and the actual failover semantics, because "managed" covers everything from genuine cross-region replication with automatic promotion, to a single instance with a nice dashboard. A managed database in one region is still a single point of failure wearing a smart suit.
"We have backups." Backups are not availability. A backup tells you that, given a few hours and a steady hand, you can rebuild. It says nothing about staying up during the event. Confusing your recovery story with your availability story is how you end up explaining to people why "we have backups" didn't keep the site online.
"We tested failover once." When? On which version of the stack? Failover that worked in a calm rehearsal eighteen months ago, against an architecture you've since changed twice, is a hope, not a control.
what's actually worth doing
Not "go multi-region for everything". That's expensive, it's complex, and for a lot of services the honest cost-benefit says a few hours of regional downtime a year is cheaper than the engineering to avoid it. Multi-region is a real commitment with real ongoing tax, and pretending otherwise is how you end up with a half-built failover that's worse than none.
What's worth doing is being deliberate, per service, about a small set of questions:
- What's the blast radius? If this region vanishes for two hours, what stops working, and who notices? Write it down. The act of writing it down is where most of the surprises surface.
- What's the actual recovery story? Not the aspirational one. The one you could execute right now, with the access you currently have and the runbook that currently exists, at 3am, half asleep. If that story has gaps, those gaps are your real exposure.
- Have you tested it since the last big change? A failover plan is code, and untested code is broken code. If you've never failed over on purpose, assume you can't.
- Is the cost of being down honestly worth the cost of being resilient? For some services, multi-region is essential. For my personal blog, it absolutely is not, and pretending otherwise would be cosplay. Match the engineering to the consequences.
the part I keep relearning
Every one of these regional events is the same lesson in a slightly new hat, and the lesson is that redundancy you haven't tested is just a feeling. The provider will keep having bad days; that's baked into the model, and choosing the cloud means accepting it. What you actually control is whether your design has a real, rehearsed answer for those days or merely an assumed one.
So this week I'm doing the unglamorous thing. I'm taking my two services that genuinely matter and writing down, honestly, what happens to each when their region degrades, and where the recovery plan is hand-wavy. I expect to find that one of them is in much better shape than I feared and the other is held together with assumptions and good luck. That's usually how it goes, and finding out on a quiet Friday is enormously preferable to finding out during the next outage.
The cloud doesn't owe you availability you didn't design and test for. It just rents you the building blocks. The redundancy is still your job, and an outage is simply the audit you didn't schedule.