Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage that wasn't the database, it was dns again

A production incident that looked like a database failure and turned out to be a single overloaded DNS resolver quietly timing out under load.

A terminal full of error output

It's always DNS. People say it as a joke, the way they say it's always lupus, and then one Tuesday it actually is DNS and you remember why the joke exists. This is the story of an outage that wore three different disguises before it admitted what it was, and what I changed so the next one is at least faster to spot.

the symptom

The first alert was latency on the API. Not errors, latency. The p99 on a handful of endpoints had crept from comfortably under 200ms to north of five seconds, and a few requests were timing out outright. The endpoints that were slow all had one thing in common, which I clocked far too late: they each made an outbound call to a third-party service.

The first instinct, naturally, was the database. It usually is. So I went and stared at the database, which was bored, idle, and slightly offended at the accusation. Connection pool healthy, slow query log empty, CPU flat. Nothing there. Twenty minutes gone.

Log lines and code on a screen

the disguises

The second disguise was the third-party service itself. The slow endpoints all called the same external API, so obviously that vendor was having a bad day. I checked their status page (green), then curled their endpoint from my laptop (instant), then from the affected host. And there it was: the curl from the host sat there for exactly five seconds before doing anything, then completed in milliseconds.

Five seconds. A suspiciously round, suspiciously configured-looking five seconds. That's not a slow API. That's a timeout. Something was waiting for five seconds and then giving up on a first attempt.

$ time curl -s -o /dev/null https://api.thirdparty.example/v1/ping
real    0m5.043s
$ time getent hosts api.thirdparty.example
real    0m5.001s

getent took five seconds. The HTTP request was never the problem. Resolving the name was the problem. The application wasn't slow, the network wasn't slow, the vendor wasn't slow. The box could not turn a hostname into an address without first timing out against a dead resolver and falling back to a second one.

the actual cause

We had two nameservers in /etc/resolv.conf. The first one had quietly fallen over earlier that afternoon, an unrelated maintenance window that nobody connected to this. The standard resolver behaviour is to try the first nameserver, wait for the timeout (the default is five seconds), then try the next. So every uncached lookup paid a flat five-second tax before succeeding against the healthy second resolver.

Internal calls were mostly fine because those names were cached or in /etc/hosts. The third-party hostname had a short TTL and wasn't cached locally, so it got re-resolved constantly, and each resolution ate the full timeout. That's why only the endpoints touching that one external service looked broken. The shape of the symptom pointed at the vendor, and the vendor was innocent.

what I changed

The immediate fix was to remove the dead resolver from the config and restart the affected services so they picked it up. Latency dropped back to normal within seconds. The five-second tax vanished.

The lasting changes are the ones that matter:

  • A local caching resolver on each host (I used dnsmasq), so a single upstream failure degrades gracefully instead of taxing every lookup. The application now talks to localhost, and localhost decides which upstream is healthy.
  • options timeout:1 attempts:2 in resolv.conf, so even a naive fallback costs one second, not five. Five seconds as a default is a relic.
  • An alert on DNS resolution time itself, measured from the hosts that matter. We monitored the database, the queue, the cache, the disk. We did not monitor whether the machine could resolve a name, which in hindsight is like monitoring the engine but not the fuel line.

The lesson I keep relearning is that the loudest symptom is rarely the cause. The database was on fire in my head for twenty minutes purely because it's the usual suspect. The real culprit was the most boring, most fundamental service in the stack, the one so reliable that it had no monitoring at all. It's always DNS, and the reason it's always DNS is that it's the thing we stop watching.