Another week, another cloud provider having a bad afternoon and taking a chunk of the internet down with it. The details differ each time, a region degrades, a control plane gets wedged, an internal dependency turns out to be a single point of failure that nobody drew on the architecture diagram. The shape is always the same. Lots of people who paid for resilience discovering they bought it on paper only.
I'm not writing this to dunk on whichever provider it was this time. I run things on the same platforms and I've been bitten the same way. The point I want to make is duller and more useful: most "multi-region" setups are a comforting story we tell ourselves, not a tested fact.
the failover you never ran is the failover that doesn't work
Here's the uncomfortable test. When did you last actually fail over? Not in a design review. Not in a diagram. In production, on purpose, with traffic, watching what breaks.
If the honest answer is "never", then you don't have a second region. You have a second region's worth of bill and a hopeful sentence in a runbook.
The things that quietly defeat regional failover, in my experience:
- A control or auth dependency that only lives in one place, so when that place hurts, everything hurts regardless of where your compute is.
- DNS TTLs measured in hours, so clients keep cheerfully connecting to the dead region long after you've moved on.
- Data that can't actually serve from the standby because replication lag was fine until the exact moment it mattered.
- Capacity in the standby region that was never sized for 100% of traffic, only the comfortable 30% it normally takes.
what I'd actually do
Stop treating the provider as infallible, and stop treating multi-region as a checkbox. Pick the failures you genuinely care about and rehearse them. A scheduled game day where you drain a region on a Tuesday afternoon will teach you more than any vendor reliability whitepaper.
And be honest about the trade. Real geographic redundancy is expensive, slow to build, and a permanent tax on every change you ship afterwards. For a lot of systems the right answer is to accept that a few hours of provider downtime a year is cheaper than the engineering to survive it, and to say so out loud rather than pretending otherwise.
The teams I trust most aren't the ones with the prettiest diagrams. They're the ones who can tell me, without flinching, exactly what happens when their primary region goes dark, because they've watched it happen on purpose. Everyone else is just waiting to find out live, usually on the same afternoon as everybody else.