Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

watching a cloud region wobble from the cheap seats

A reflection on watching a public cloud incident play out live on the status page, and why someone else's outage still costs you a morning.

A wall of news and status dashboards

There is a particular spectator sport that engineers pretend they don't enjoy, and that is refreshing a cloud provider's status page during an incident while quietly grateful it isn't your own dashboard going red. Earlier this month I got to play that game properly, watching one of the big providers work through a regional wobble in real time, and it reminded me how strange it is that so much of the modern internet leans on the same handful of buildings.

I won't pretend to know the exact root cause as I write this. The status page did its usual dance: "we are investigating elevated error rates", then "we have identified the issue", then the long quiet stretch where you know engineers somewhere are not having a nice morning. The honest version of any status page is that the green ticks lag reality by anywhere from ten minutes to an hour, and the yellow exclamation marks lag it by longer.

A city skyline at dusk

What I find interesting is not the outage itself, which will be written up and post-mortemed and largely forgotten, but the blast radius. A single region having a bad hour rippled out into things that have no obvious connection to it. A SaaS tool we use for invoices got slow. A CI runner queue backed up because the artefact store it pulls from was in the affected region. None of these were my systems, and all of them were my problem for about ninety minutes.

That is the real lesson and it is an old one. You can run a tidy, well-architected service and still have your day decided by a dependency three hops away that you never chose and can't see. The cloud didn't remove single points of failure. It moved them somewhere you can't reach, gave them a nicer name, and put a status page in front of them.

I'm not making the smug on-prem argument here. I run plenty in the cloud and it is mostly excellent, which is rather the point: it is good enough that we forget it is a shared building with shared walls. The provider will recover, publish a measured post-mortem, and we'll all nod and carry on.

The thing worth actually doing, rather than tutting, is to know your own dependency graph well enough that when the next region wobbles you can answer "does this touch us?" in two minutes instead of forty. I couldn't, today. I spent half an hour working out that our invoice tool even lived where it lived. That half hour is the bit I can fix before the next one, and there is always a next one.