another cloud outage, and what we should learn

Newspaper print stacked next to a server rack

Earlier this month another big provider had a bad few hours in one of its regions, the kind where the status page stays a calm reassuring green for the first forty minutes while everyone's pager is on fire. The details don't matter much because they're always the same: a control-plane component fell over, things that depended on it queued up, retries made it worse, and eventually a graph went the wrong way for long enough that someone wrote a post-mortem about it. I watched it from the cheap seats this time, which is a nice change.

What does matter is how predictable the whole shape of it was, and how little most of us have actually changed in response to the last one. We say "design for failure" in interviews and on conference slides, and then we wire a hard synchronous dependency on someone else's regional service straight through the middle of our request path and call it done.

the dependency you forgot you had

The bit that catches people is rarely the obvious thing. Everybody knows their database can go away. Fewer people have sat down and drawn the full graph of what a single user request actually touches.

Here's a small exercise that's embarrassed me more than once. Take one important endpoint and follow a request all the way through. Not the diagram on the wiki, the real thing.

GET /checkout
  -> auth (token introspection -> external IdP)
  -> feature flags (network call, no local cache)
  -> pricing (internal service -> external FX rate API)
  -> inventory (DB)
  -> session store (managed cache)
  -> metrics push (blocking flush on shutdown, oops)

Half of those will have a timeout you've never tuned, and at least one will be a synchronous call to something outside your blast radius that nobody listed when asked "what are our dependencies". The feature-flag call is my personal favourite. It feels free. It is not free. When the flag service is slow, every request in the system is slow, and you discover that the SDK's default timeout is something cheerful like ten seconds.

City skyline at dusk with lit office windows

timeouts, retries, and the thundering herd

The outage gets genuinely interesting when the recovery makes it worse. The provider comes back, every client that was failing reconnects at once, and the freshly recovered service falls straight back over under the stampede. We did this to ourselves for years before "exponential backoff with jitter" became common wisdom, and plenty of code still doesn't do it.

A few things that are unglamorous and actually help:

Set a timeout on every network call, and set it lower than you think. A call that takes three seconds is usually a call that has already failed; you're just paying to find out.
Cap retries and add jitter. Three attempts with randomised backoff, not "retry until success", which is a denial-of-service tool you've pointed at your own supplier.
Make the call cancellable. A request the user abandoned ten seconds ago should not still be holding a connection open behind the scenes.

None of this is new. Michael Nygard wrote most of it down in Release It! the better part of a decade ago, and the circuit-breaker pattern is older still. We just don't wire it in until after the first time it hurts.

The other half of the retry problem is idempotency, which nobody wants to think about because it's tedious. If your retry might fire after the original request actually succeeded but the response got lost, you need the server to recognise the duplicate and not charge the card twice. An idempotency key on the write path is dull plumbing right up until the afternoon it saves you from refunding a few hundred customers by hand. I've been on both sides of that and the side without the key is much worse.

"multi-region" is a verb, not an adjective

The other reflex after an outage is to declare that we'll go multi-region, as though it were a checkbox. Then you look at the bill, and the data-consistency story, and the fact that your "second region" shares a control plane or an account-level quota or a DNS zone with the first, and the whole thing turns out to be a single point of failure wearing two coats.

If you actually want regional independence you have to test it. Pull the plug, on purpose, on a schedule, when people are awake and watching. The teams who survive these days without drama are the ones who fail over so often it's boring. The rest of us write aspirational architecture diagrams and hope.

City street at night with traffic light trails

the honest version

Here is the part nobody puts on the slide. For most businesses, most of the time, the correct response to "the cloud might have a bad afternoon once or twice a year" is not a heroic active-active build-out costing a fortune in engineering time and egress. It's a smaller, duller set of choices: sensible timeouts, a cache that can serve stale data when the source is unavailable, a graceful degraded mode that tells the user "search is having a moment" instead of returning a 500, and a status page that doesn't lie.

Decide what your actual availability target is, in numbers, with money attached, and then spend in proportion. Four nines is a different company to three nines, and pretending you're the former while funding the latter is how you end up with the worst of both: the cost of complexity and none of the resilience.

The outage this month wasn't special. The next one won't be either. The only thing we genuinely control is whether our own software amplifies the failure or quietly absorbs it. Most of mine still amplifies it, and I've got a list of timeouts to go and tune, and a feature-flag SDK to put a sensible cap on before it embarrasses me on a day I'm not watching.