Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

watching an outage land while the status page stayed green

Reflections on watching a widely-reported service disruption ripple outward in real time, and why dependent status pages lag the actual failure.

A wall of news headlines about a tech outage

There was a run of widely-reported service disruptions this month, the sort where a single provider hiccups and you watch the blast radius spread across half the internet over the following twenty minutes. I happened to be at my desk for one of them, and there's a particular fascination in watching it happen rather than reading the post-mortem a week later.

The tell is always the same. Something you depend on starts timing out. You check its status page: green, all systems operational. You check the thing it depends on: also green, but with the first trickle of comments underneath from people saying "is anyone else seeing this?". The actual failure is already well underway by the time any official dashboard admits to it, because status pages are updated by humans who first have to notice, confirm, and decide it's bad enough to say so. The machines failed minutes ago.

What struck me, watching it, was how much of our tooling assumes the dependencies below us are honest about their own health. They're not lying, exactly. They simply don't know yet. By the time the incident banner goes up, you've already spent fifteen minutes convinced the fault is yours, restarting things that were never broken.

I don't have a tidy lesson here, only a habit it reinforced: when everything you own looks fine but nothing works, widen the search before you start taking your own stack apart. Sometimes the green light is just slow.