another cloud outage, and what we should learn

Newsroom screens showing a service status page

Another month, another status page glowing amber. There was a wobble in a major provider's infrastructure earlier this month, the sort that takes out a region and lights up Twitter for a few hours before the green ticks creep back in. I won't pretend I read the postmortem with detached professional interest. I read it because a thing I look after was, briefly, one of the things that fell over.

What strikes me, every single time, is how rarely the lesson is the one in the headline. The headline is always the proximate cause: a config push, a failed failover, a network partition that shouldn't have been possible. Useful to know. Not the point.

A city skyline at dusk with office lights on

The point is that we keep building single-region systems and then acting surprised when a single region has a bad day. We pay for the cloud partly so we don't have to think about racks and power and cooling, and somewhere in that bargain a lot of us quietly stopped thinking about availability zones too. One region, one zone if we're honest, and a vague intention to "go multi-region later". Later never arrives, because multi-region is genuinely hard and the outages are genuinely rare, and that maths usually wins right up until the morning it doesn't.

I'm not throwing stones from higher ground here. The thing that fell over for me was in one region because making it span two would have roughly doubled the cost and the complexity for an SLA nobody had actually asked me to meet. That was a reasonable decision when I made it. It was still the reason I spent a chunk of the afternoon watching a load balancer return 503s and feeling helpless, which is a uniquely undignified state for an engineer to be in.

So what should we actually learn? Two things, and neither is "use a different provider".

First, know your blast radius before the outage, not during it. When the region went, I could not immediately tell you which of my services depended on it and which didn't, and that ignorance cost me more time than the failover itself ever would have. A simple dependency diagram, kept honest, is worth more than a lot of clever redundancy.

Second, decide on purpose how much downtime you can live with, and write it down. If a few hours a year is fine, say so, and stop apologising for a single-region design that is doing exactly what you asked of it. If it isn't fine, then multi-region stops being a someday nicety and becomes a line item, with a cost you can put in front of whoever signs off the budget.

The cloud is not less reliable than the cupboard of servers it replaced. It's more reliable, by a distance. We've just moved the failure from "the building lost power and you knew about it" to "a region you can't see did something you can't control", and that second kind is harder to feel responsible for. The honest work is deciding, in daylight, what you'll do when it happens again. Because it will.