another region falls over, and the lesson we keep not learning

A city skyline at dusk

There's been another cloud outage in the last couple of weeks, the sort where a region has a bad few hours and a long tail of people discover their "highly available" setup was highly available right up until it wasn't. I won't pretend to know the precise root cause from the outside, the post-incident write-up will say it properly in due course, but the shape is always the same. A shared dependency in one location degrades, and everything that quietly assumed that location would always be there degrades with it.

The reaction I find tiresome is the one that goes "this is why the cloud is a mistake". It isn't. The provider's own infrastructure is almost certainly run by people more disciplined about failure than most of us manage in our own racks. The mistake is upstream of that, in the story we told ourselves when we migrated.

A city at night

The story was: we moved to the cloud so we don't have to worry about hardware failing. That was never the deal. The deal was that you stop worrying about this specific disk and start worrying about this specific region. The failure domain moved and got bigger. You traded a problem you could see and touch for one you can only design around, and designing around it is work you still have to do. A single region is a single point of failure with very good uptime numbers and excellent marketing. Those are not the same as no single point of failure.

So what should we actually take from it? Mostly that the boring advice was right.

Know your blast radius. Sit down and ask, honestly, what happens if one availability zone vanishes. Then ask the harder one: what happens if a whole region does. If the answer to the second is "we're down", that might be a perfectly reasonable business decision, plenty of services don't justify multi-region spend. But it should be a decision someone made on purpose, with a number attached, not a surprise you find out about during the incident.

Test the failure, don't just architect for it. A standby region you have never failed over to is a hypothesis, not a plan. The first time you exercise a failover should not be the time you needed it. The DNS TTLs are wrong, the replica is three hours behind, the secondary's IAM permissions were never finished. You learn all of this in a calm Tuesday game-day or you learn it at the worst possible moment. There is no third option where you never learn it.

And be honest in the status page. The outages that burn trust aren't the technical failures, those are forgivable and universal. It's the green dashboards sitting smugly over a service that's plainly on fire. Customers will forgive you being down. They take much longer to forgive being lied to about it.

None of this is new, which is rather the point. We relearn it every time a region has a bad day, write a thoughtful retrospective, and then let the discipline slide until the next one. The cloud didn't abolish operations. It just changed the unit you do operations on, and handed you a bigger bill when you forget.