The pages went off all at once, which is itself a clue, though not the one I read first. When a single service falls over you get one alert. When everything alerts within the same thirty seconds, the thing that broke is not a service. It is something underneath all of them, and there are only a few candidates: the network, the clock, or DNS. It is, as the joke goes, always DNS, and the joke is a joke because it is so reliably true.
My first instinct was wrong in the usual comforting way. I assumed a service was down and started checking the loudest one. It was up. CPU normal, memory normal, it was happily listening on its port. So was the next one. So was the database. Every host was up, every process was running, every health check that did not involve a name was green. Nothing was down. Everything was failing.
The thing they all had in common was that they talked to each other by name, and names had stopped resolving.
I confirmed it the moment I stopped trusting my assumptions and asked the resolver directly:
$ dig +short api.internal.example
$ dig api.internal.example
;; ... status: SERVFAIL ...
$ ping 10.0.4.21
64 bytes from 10.0.4.21: icmp_seq=1 ttl=64 time=0.31 ms
The IP answered instantly. The name returned nothing, then SERVFAIL. So the network underneath was fine, the hosts were fine, and the only broken layer was the one that turned names into addresses. Every application in the estate connected by hostname, so when the resolver went quiet, every connection failed at the very first step, before a single packet of real traffic was sent. It looked exactly like a total outage because functionally it was one, just located one layer lower than anyone's dashboards were watching.
The resolver itself had not crashed, which is why it took longer than it should have. The process was up and answering, it had simply lost its upstream. A configuration change earlier in the day had pointed it at a forwarder that was no longer reachable, and with no working upstream and a cold cache, it had nothing to say. So it said SERVFAIL, politely and consistently, to every query in the building. Restarting it with the correct forwarder brought the names back, and as the names came back the alerts went green in a wave, in roughly the order each service next retried.
Two things I took away, neither of them new, both worth paying for again. First: when everything breaks at once, do not debug a service, debug the layer they share, and DNS is almost always the cheapest one to rule out first with a single dig. Second, and this is the one I actually changed: monitor name resolution as its own thing, separately from the services that depend on it. A check that resolves an internal name every minute and pages on failure would have told me in plain English what I spent twenty minutes deducing from a screen full of healthy hosts that could not find each other.