it was never the app, it was always dns

A terminal showing a bug

The pager went off mid-morning. Checkout was slow, then checkout was timing out, then half the dashboards were red. The symptom everyone fixated on was the database, because the application logs were full of connection timeouts to the database, and when you see a hundred database timeouts your brain stops reading after the word "database".

I spent the first twenty minutes there. The database was fine. CPU flat, query latency normal, plenty of headroom in the connection pool. It was sat there bored while the application insisted it couldn't reach it. That contradiction is the tell, and I should have spotted it sooner: if one side says "I can't connect" and the other side says "nobody's connecting", the problem is almost never either of them. It's the bit in between.

Following the actual latency

The application connects to the database by hostname, not by IP. That hostname has to be resolved. So I stopped reading application logs and started timing the resolution itself.

$ time dig db-primary.internal +short
10.4.2.19

real    0m5.012s

Five seconds. To resolve a name that should take single-digit milliseconds. There it is. Every new connection the application opened was sat for five seconds waiting on DNS before it even tried to talk to the database, and with the connection pool churning under load, that five seconds was being paid constantly. The "database timeout" was a DNS timeout wearing a database's coat.

Code on a screen

Why five seconds, exactly

Five seconds is a very specific number, and specific numbers in failures are gifts. The default resolver timeout in glibc is five seconds before it gives up on a nameserver and tries the next one in /etc/resolv.conf. So the machine was asking its first nameserver, getting nothing, waiting the full timeout, and only then falling through to the second.

The first nameserver had quietly died. Not crashed loudly, which would have been kind. It was up enough to accept the UDP packet and then never answer, which is the worst failure mode there is, because every client politely waits the full timeout for a reply that's never coming. A dead server that refuses connections fails fast. A zombie server that swallows queries fails slow, and slow is what takes you down under load.

The fix, and the realer fix

The immediate fix was thirty seconds of work: pull the dead resolver out of resolv.conf on the affected hosts so queries went straight to the one that worked. Latency dropped to normal, the connection storm cleared, the dashboards went green. Incident over, for values of over.

The realer fixes took longer and mattered more.

Run a caching resolver on each host, something like dnsmasq or systemd-resolved, so a momentary upstream wobble is absorbed locally instead of becoming application latency. A name resolved once shouldn't be resolved a thousand times.
Monitor resolution latency directly, as a first-class signal, not just resolver process liveness. The dead box was "up" by every check we had. Nothing was timing how long it took to actually answer a query, which is the only thing that mattered.
Shorten that glibc timeout. options timeout:1 attempts:2 in resolv.conf turns a five-second cliff into a one-second stumble. It won't stop a resolver dying, but it stops one dying resolver holding everything hostage.

The thing I keep relearning, and apparently will go on relearning forever, is that the loud error is rarely the cause. The logs screamed "database" because the database was the last thing in the chain that noticed something was wrong. The actual fault was three layers earlier and entirely silent. "It's always DNS" is a tired joke precisely because it's so often true, and the reason it's so often true is that DNS sits underneath everything and fails quietly. Time the layer below the one that's complaining. It's usually already telling you.