another cloud outage, and the lessons we keep not learning

A city skyline standing in for the abstract cloud

Earlier this week a chunk of the internet had a wobble again. The specifics matter less than the shape, because the shape is always the same: a thing everyone depends on without quite realising they depend on it goes away, and suddenly a dozen services that have nothing to do with each other all fall over at once. This time, depending on which corner of the web you read, it was variously a routing problem, a DNS resolution issue, and a provider having a bad afternoon. I am being deliberately vague on the exact cause because by the time the real root-cause writeup appears, half of what we confidently said on the day will turn out to be wrong.

What I want to talk about isn't this outage. It's the fact that I could have written most of this post in advance.

the pattern

Pull up any large outage from the last few years and the narrative rhymes. A change goes out. It is, in isolation, small and reasonable. It interacts with something else nobody had modelled, usually at a layer of the stack that the person making the change couldn't see. The blast radius is enormous because the failing component sits underneath a thousand things that quietly assumed it would always be there. And the recovery is slow, not because the fix is hard, but because the tooling needed to apply the fix also depended on the broken thing.

That last point is the one that keeps getting me. We build control planes that route through the data plane they're meant to control. We put the runbook on the wiki that's hosted on the cluster that's down. We store the credentials needed to fix the outage behind the SSO that the outage took out.

A street-level view of a busy city, the kind of complexity we pretend we've abstracted away

the post-mortems all say the same things

I've now read enough public incident reviews to notice they converge on a small set of action items. Reduce the blast radius. Stagger the rollout. Add a circuit breaker. Make the recovery path independent of the thing that failed. Practise the failure before it happens for real.

These are all correct. They are also, mostly, the same items that appeared in the previous post-mortem, and the one before that. The reason they don't stick is not that engineers are lazy or that companies don't care. It's that every one of these items is a tax paid continuously, against a benefit that only shows up on the worst day of the year. Staggered rollouts are slower. Circuit breakers add complexity and the occasional false trip. An independent recovery path is, by definition, a second system you have to build and maintain and that does nothing useful 364 days out of 365.

So they get deprioritised. Not deliberately, just by the steady gravity of the backlog. The feature ships, the resilience work slides, and everyone is genuinely surprised when the same class of failure recurs eighteen months later with a different logo on the status page.

what I actually try to do

I'm not going to pretend I've solved this. But there are a few habits that have earned their place on the systems I'm responsible for.

The first is dependency honesty. When someone says a service is "highly available", I ask what it depends on, and then what those things depend on, until we hit something boring like a single DNS zone or a single certificate authority or one bloke's understanding of the network. The available bit is only as available as its least available dependency, and that dependency is almost never the one on the architecture diagram.

Traffic and infrastructure tangled together, much like our dependency graphs

The second is rehearsing the recovery, not the failure. Anyone can kill a node. The interesting question is whether you can bring the thing back when half your usual tooling is also down. I try to make at least one game-day per year specifically about a degraded control plane: you may not use the dashboard, the wiki, or the chat tool, because in the real thing those will be casualties too. The first time you run this it is genuinely humbling.

The third is the most boring and the most effective: keep the blast radius small enough that an outage is an inconvenience rather than a headline. This is unglamorous. It means more, smaller failure domains, more boundaries, more duplication. It is the opposite of the consolidation that every cost review pushes you towards. And it is the single biggest lever on whether a bad afternoon becomes a bad week.

the uncomfortable bit

There's a temptation, after a big provider stumbles, to conclude we should all run our own infrastructure again. I don't buy it. The teams operating these platforms are, on average, far better at this than the rest of us, and our home-grown alternatives would fall over more often and recover more slowly. The lesson isn't "centralisation bad". It's that concentration of dependency is a real risk that we keep pricing at zero because the bill only arrives occasionally.

So when the writeup for this week's incident lands, by all means read it. But read it as a mirror, not as gossip about someone else's bad day. The honest question is not "how did they let that happen", it's "which of my systems has exactly this shape, and have I done any of the boring things that would make it survivable". For most of us, on most days, the honest answer is "not yet". That, more than any individual provider's mistake, is the thing worth learning.