It is always DNS. We say it as a joke, right up until the morning it is actually DNS and nobody is laughing.
The symptoms were a beautiful spread of nonsense. The API was timing out talking to the database. The web tier was throwing connection errors to the API. A background worker was failing to reach the object store. Three different teams' worth of dashboards going red at once, and every one of them pointed confidently at a different culprit. The classic shape of an outage where everything is broken, which usually means nothing specific is broken and something underneath everything is.
chasing the wrong things
We started where the noise was loudest, which was the database. Connections were piling up, the pool was exhausted, so obviously the database was the problem. Except it wasn't. The database was fine. It was sat there bored, accepting the connections it could actually see, while the application servers spent thirty seconds trying to resolve its hostname before every single connection attempt.
That thirty seconds is the tell. Healthy DNS is sub-millisecond and you never think about it. The moment a resolution takes the full timeout and then succeeds (or fails) on a retry, you get exactly this pattern: everything slow, everything intermittent, nothing actually down.
the actual cause
One of our internal resolvers had quietly fallen over in the night. We ran two, as you do, and resolv.conf listed both. The trouble is that the standard resolver behaviour is to try the first nameserver, wait for the timeout, then fall through to the second. With the first one dead, every lookup ate the full timeout before the healthy resolver answered. Nothing was unreachable. Everything was just paying a tax of several seconds per name, and at our request volume that tax bankrupted every connection pool we owned.
You can confirm this faster than you can argue about it:
# fast, because it talks to the healthy resolver directly
dig @10.0.0.6 db.internal +short
# slow, because it walks resolv.conf and waits on the dead one first
time getent hosts db.internal
When the second command takes five seconds and the first takes nothing, you are done diagnosing. Restart the dead resolver, or pull it from resolv.conf, and watch every unrelated dashboard go green at the same instant. That synchronised recovery is the other tell. Genuinely independent systems do not heal in lockstep.
what we changed
We added options timeout:1 attempts:2 so a dead resolver costs us one second rather than five, stood up proper health checking on the resolvers themselves, and put DNS resolution time on a graph where a human would actually see it. The fix was ten minutes. The diagnosis was two hours, almost all of it spent looking at the database because that was where the symptoms were screaming loudest.
The lesson I keep relearning: when several unrelated things break at once, stop staring at any one of them and look underneath all of them. It is the shared dependency, and depressingly often, it is DNS.