Everything broke at the same instant, which is the most useful clue you can be handed and the easiest one to misread. When a single thing fails, you suspect that thing. When everything fails together, the cause is almost never "everything" and almost always one shared dependency sitting underneath the lot. The web app, the background workers, the metrics pipeline, the thing that emails me when things break: all dead, all at once. The thing that emails me being dead is how I found out late.
My first reaction was network, because simultaneous total failure smells like a cut cable. But the hosts could ping each other by IP perfectly well. The boxes were up, the links were up, packets flowed. What did not work was anything that involved a hostname. curl to an IP: fine. curl to a name: hung, then failed. That is the tell.
$ dig +short api.internal.example
;; connection timed out; no servers could be reached
No resolver answered. Every application that resolved a name before doing its actual job, which is to say every application, sat there blocking on a DNS query that would never return, then fell over once it timed out. The services were not broken. They were waiting for an answer to "where is the database" that nobody was giving them.
The cause was embarrassingly small. Someone, and by someone I mean a change I had pushed the previous evening and forgotten about, had updated the resolver configuration to point at a new internal DNS server that was meant to go live. The new server was not actually serving yet. The config rolled out on the next configuration run, the old resolver was removed from the list, the new one did not answer, and the machines were left pointing at a void. No fallback, because I had replaced the list rather than appending to it. Classic.
The fix took thirty seconds once I understood it: point the resolvers back at something that answered, let resolution recover, and watch every service quietly come back to life without a single restart, because none of them had actually crashed. They had just been holding their breath.
I want to draw out the two real lessons, because "it's always DNS" is a meme but it is a meme for a reason.
The first is that DNS is a shared dependency hiding in plain sight. Almost nothing connects to a raw IP any more; everything resolves a name first. That makes the resolver a single point of failure underneath services that otherwise have nothing to do with each other, and it makes a DNS outage present as a total, simultaneous, baffling collapse rather than as "DNS is down". You have to learn to read the pattern, because the symptom never says the cause.
The second is the operational sin I actually committed: I replaced a list I should have extended. Cutting over to a new resolver should have meant adding it alongside the old one, confirming it answered, and only then removing the previous entry. Instead I swapped it in one move and trusted that the new server would be ready, and trust is not a deployment strategy.
So now the resolver list always carries a known-good fallback, cutovers are additive, and I have an alert that checks resolution from outside the cluster specifically so that the thing telling me DNS is broken does not itself depend on DNS. It is always DNS, and the least I can do is be ready for it to be DNS.