it was never the network, it was the resolver

A terminal showing a stuck process

The week between Christmas and New Year is meant to be quiet. Skeleton staff, nobody deploying, a chance to clear the backlog of small things. So of course that is when half the internal services decided to stop talking to each other.

The symptoms were maddening because they didn't agree with each other. Some requests went through instantly. Some hung for exactly five seconds and then succeeded. A few just timed out. The dashboards showed elevated latency but no errors worth shouting about, and the on-call alert that woke me up was the vague kind: "service degraded". Brilliant. Degraded how? In which direction? It declined to say.

My first instinct was the network, because it is always the network until it isn't. I checked the switches, the firewall counters, the load balancer. All boring, all green. Pings between hosts were sub-millisecond and lossless. Whatever was happening, packets were getting where they were sent. That five-second pause kept nagging at me though, because five seconds is not a random number. Five seconds is a default timeout.

Logs scrolling past on a screen

It was DNS. It is always DNS, and I say that having written almost exactly those words about a different outage a year ago, which tells you how well I learn. One of the two internal resolvers had wedged itself: still up, still answering pings, still accepting connections on port 53, but no longer actually returning answers. The clients would try resolver one, wait out the timeout, fall back to resolver two, and succeed. That was the five-second pause. The requests that "failed" were the ones whose total deadline was shorter than the resolver timeout, so they gave up before the fallback could save them.

A healthy port and a healthy service are not the same thing, and a TCP connection that opens tells you nothing about whether the thing behind it is doing its job. I had a check that confirmed the resolver was listening. I did not have a check that asked it to resolve a known name and verify the answer. The difference between those two checks was the difference between a green dashboard and four hours of confusion.

# the check I had
nc -zv resolver1 53

# the check I now have
dig +short +time=2 +tries=1 @resolver1 known-internal.host | grep -q 10.0.

The wedged resolver got restarted and everything snapped back instantly, which is the most unsatisfying kind of fix because it tells you nothing about the cause. I'll dig into why it stopped answering when I'm back properly. For now the monitoring actually queries a name and checks the reply, and the alert that fires says "resolver1 not resolving" instead of "service degraded".

Lesson, again, written down again, in the hope that next year it sticks: monitor the behaviour you depend on, not the port it happens to listen on. Right, back to the mince pies.