The pager went off at 08:40 with the worst possible kind of alert: everything, all at once. The API was throwing 500s, the background workers were failing, the admin dashboard wouldn't load, and three separate Slack channels lit up with the same panic. When everything breaks simultaneously, it's almost never everything. It's one thing underneath all of them.
The first useful clue was the error text. Not connection refused, not timeouts, but "name or service not known" buried in the application logs. The services weren't failing to reach the database; they were failing to figure out where the database was. That narrowed an apparently total outage down to a single layer: resolution.
chasing the resolver
A quick dig from one of the affected hosts told the story:
$ dig +short db-primary.internal
;; connection timed out; no servers could be reached
The internal resolver wasn't answering. Worse, when it did answer, it answered wrong: a couple of the lookups returned an old IP from a host that had been decommissioned weeks earlier. So some requests timed out and some were cheerfully connecting to nothing, which is why the failure looked so scattershot from the outside.
The internal zone was served by a pair of resolvers behind a virtual IP. One of the two had been restarted overnight by an automated patch run, and it had come back up with a stale cached view of the zone because the file it loaded on boot hadn't been refreshed since the last decommission. The healthy resolver and the stale one sat behind the same VIP, so roughly half of every service's lookups went to a node confidently telling lies, and the other half timed out waiting on a node that was struggling under the retry storm the first one caused.
stopping the bleeding
The fast fix was to pull the bad node out of the VIP so every query went to the healthy resolver. The dashboard came back within a minute, the workers drained their backlog, and the 500s stopped. Total restore time once we'd actually found it was about ninety seconds. Finding it took the other forty minutes, which is the usual ratio.
Then the bad node got its zone reloaded from the source of truth rather than from whatever it had clung to on boot, and only went back into the VIP once dig against it directly returned the correct, current records.
The real fix was further upstream and less dramatic. The resolvers now load their zones from a single managed source on startup, with a check that refuses to come up healthy if the serial number is older than the published one. A resolver that can't prove it has the current data should take itself out of rotation, not serve confidently wrong answers to half the fleet.
I keep a small mental list of failures that look like a hundred problems but are really one, and DNS is permanently at the top of it. When every service breaks at once and none of them share code, look at what they all share underneath. It's the network, the clock, or the names. This time it was the names. It usually is.