Everything broke at once, which is usually the good news. When one service fails you go hunting. When everything fails simultaneously, it's almost never everything, it's one thing underneath everything. And the thing underneath everything is, more often than I'd like, DNS.
The symptoms were a lovely spread of red herrings. The app couldn't reach the database. The metrics pipeline couldn't reach its endpoint. Health checks timing out across unrelated services. Each team's first instinct was that their own thing had fallen over, and each was wrong in the same way.
dig told the truth in about four seconds. Lookups against the internal resolver were timing out, full stop. The IPs were all fine; if you hardcoded one into /etc/hosts the service sprang straight back to life. Nothing was actually down except the part that turns names into numbers, and without that, nothing else can find anything.
The cause was dull, as it always is: a resolver had been restarted with a config that no longer pointed at the right upstream, and the cached entries aged out one by one until the cliff edge arrived all at once. That's the cruel bit about DNS failures. The TTLs mean the building doesn't collapse, it erodes, and then everyone falls through the floor together a few minutes later.
Fixed the resolver config, lookups recovered, the red herrings swam off. The lasting change was a check that actually resolves a known internal name on a schedule and screams if it can't, rather than assuming DNS works because it usually does. It usually does. Right up until the afternoon it very much doesn't.