Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

another cloud wobble, and the lesson we keep not learning

A reaction to the latest run of cloud provider hiccups, and why "the cloud is someone else's computer" still hasn't sunk in.

A city skyline standing in for cloud infrastructure

There has been another round of provider wobbles this month, the sort that fills the dependency graphs of half the internet with red and turns your timeline into a support forum. The specifics matter less than the pattern, and the pattern is by now familiar: a single region has a bad day, and a surprising number of services that swore they were resilient turn out to have been resilient only on paper.

I don't write this to gloat. I run things in the cloud too, and I have made every one of these mistakes. But each outage is a free audit of everyone's assumptions, and it's worth reading what it tells us rather than just waiting for the status page to go green again.

"multi-region" is a verb, not an adjective

The thing that gets people is not that a region failed. Regions fail. It's that they had ticked the box marked "highly available" and assumed that meant something. Running in one availability zone with an autoscaling group is not multi-region. A read replica you have never failed over to is not a disaster recovery plan, it's a hope. If you have never actually pulled the plug and watched the failover happen, you don't have a failover, you have a diagram.

A wide cityscape, the kind of sprawl that stands in for a region full of servers

the dependencies you forgot you had

The other recurring lesson is hidden coupling. Your app might be spread across three zones beautifully, and then it turns out everything authenticates against one service in one place, or pulls config from one bucket, or resolves through one set of DNS that happened to live in the affected region. The blast radius is never where you drew the box. It's wherever the thing you forgot about lives.

A few habits that survive contact with these mornings:

  • Know your hard dependencies, the ones that take everything down with them, and treat them with suspicion.
  • Cache aggressively enough that a brief upstream outage degrades rather than kills.
  • Practise the failover on a calm Tuesday, not during the incident.

someone else's computer, still

The phrase "the cloud is just someone else's computer" gets trotted out as a sneer, but it's the most useful framing we have. You have outsourced the operation, not the responsibility. When the provider has a bad day, your customers still ring you, not them. That's the deal, and it's a perfectly good deal, as long as you remember you signed it.

The providers will publish their post-mortems, and they're usually honest and genuinely worth reading. The harder homework is the one we keep skipping: writing our own. What would have actually happened to my service if that region had stayed dark for six hours? If the answer is "I'm not sure", that's the work, and it's a much better use of the morning than refreshing a status page.