Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

the cloud went down again, and we acted surprised again

A regional cloud wobble took out a clutch of services, and the only real lesson is the one nobody wants to action.

A newsroom-style montage of servers and headlines

There was another provider wobble in the past couple of weeks, the usual shape: a region degrades, a control plane gets unhappy, and suddenly half your Twitter feed is people discovering their multi-region setup was single-region all along. I won't pretend to know the exact root cause yet, the proper post-mortem usually lands a week or two later, but the pattern is tediously familiar.

What gets me is the reaction every single time. People treat each outage as a freak event, a one-in-a-thousand-days thing they couldn't possibly have planned for. Except your provider publishes a number for exactly this. Three nines is roughly nine hours a year of being down, and you signed up for it. If nine hours of unavailability would end your business, the cloud didn't fail you, your architecture did.

The honest lesson is boring and nobody wants to do it. Know which dependencies are single points of failure. Test what actually happens when one of them goes away, on purpose, before it happens to you. And accept that "we'll just move to multi-region" is a line item that costs real money and real complexity, so either pay for it or stop being surprised.

I'm not exempt. I have a box right now whose entire disaster recovery plan is "I'll be awake and I'll notice." That's a choice, not a strategy, and at least I know it.