Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

the status page was green and the service was on fire

After another round of cloud service wobbles, a look at why status pages lag reality and what to monitor instead.

A wall of technology news headlines

It has been one of those fortnights. A widely used AI service spent part of a day producing confident nonsense before it was rolled back, and a few of the usual big platforms had their own wobbles on top. None of it was apocalyptic. What stuck with me, again, was the gap between when things broke and when the status page admitted it. For a good while the dashboards were a reassuring green while users were very much not having a reassuring time.

I am not here to dunk on anyone's status page. I have run them. They are genuinely hard, and the reason they lag is structural, not lazy.

A city at dusk, lights on, the usual backdrop for an outage story

A status page is a human-curated, externally-facing statement of fact, and every one of those words slows it down. Human-curated means someone has to decide that this is real and not a blip, which takes minutes you do not have during an incident. Externally-facing means there is reputational weight on flipping it red, so the bar to do so is, consciously or not, a little higher than it should be. And "statement of fact" means it usually waits for confirmation, which arrives after the thing it would confirm has already ruined your morning. The page is not lying. It is just downstream of the truth, by exactly the amount of time it takes a person to be sure.

The lesson I keep relearning is that someone else's status page is a courtesy, not a monitor. If your alerting strategy for a third-party dependency is "we'll see if they post an incident," you have outsourced your own observability to a team whose incentives are not aligned with your pager. They will tell you eventually. You need to know now.

So the practical version, the bit I actually act on:

  • Monitor the dependency yourself, from your own perspective, with a synthetic check that exercises the path you care about. Not "is their homepage up," but "can I complete the one operation I rely on." A provider can be perfectly healthy in aggregate and broken for exactly your use case, and only your own probe will catch that.
  • Treat a third-party status page as a signal to correlate, not a source of truth. If your synthetics are red and their page is green, believe your synthetics. They have less to lose by being honest.
  • Decide in advance what you do when a dependency is degraded, because you will not invent a good answer mid-incident. Fail open, fail closed, queue and retry, serve stale: pick one per dependency before it matters.

The AI service that went briefly strange is a nice illustration of a subtler point too. It was up. It was responding. It was returning HTTP 200s with garbage inside them. Half of monitoring is built to catch "the service is down," and almost none of it catches "the service is lying," which is the failure mode that hurts most and shows up least. Green is not a synonym for working. It never was. The fortnight just reminded me again, and I went and added a couple of checks that look at what came back rather than merely whether anything did.