it's never dns, until it is

A terminal full of failed connection errors against a dark background

The pager went off at 07:14 with the worst kind of alert: half of everything was failing, the other half was fine, and nothing had been deployed in twelve hours. No release to revert, no config change to point at, no obvious blast radius. Just a scattering of services throwing timeouts and a few that did not, with no pattern I could see from the dashboard.

The first lie the system told me was that it looked like a network problem. The failing calls were timing out, not refusing. Timeouts feel like the network: a packet went somewhere and never came back. So I spent the first ten minutes, which I would like back, staring at our load balancers and convincing myself a link had gone bad somewhere upstream. It had not. The link was fine. The packets were fine. They just had nowhere to go, because the names they were trying to reach no longer resolved.

the first real clue

The thing that broke the spell was curl against a failing service by IP instead of by name. It worked instantly. By name, it hung for five seconds and then gave up. That five seconds is a tell, because five seconds is a default resolver timeout, and a five-second pause before failure almost always means something is waiting on DNS and not getting an answer.

$ curl -s -o /dev/null -w '%{time_total}\n' http://orders.internal/health
5.012

$ curl -s -o /dev/null -w '%{time_total}\n' http://10.4.12.31/health
0.004

So it was DNS. Of course it was DNS. The running joke exists because the joke is usually true, and I had wasted ten minutes pretending it might be something more interesting. There is a particular flavour of shame in this, because every engineer who has been doing this for more than a year has the "it's never DNS" mug, metaphorically or otherwise, and yet we all still spend the first ten minutes of every outage assuming it is anything else. The network, the load balancer, a noisy deploy, the moon. Anything but the boring, central, single-point-of-everything name resolution that the entire estate quietly depends on and nobody ever monitors with the seriousness it deserves.

The next question was why some names resolved and some did not. Both kinds of name lived in the same internal zone. Both were served, in theory, by the same resolvers. I dropped to dig and asked directly, and the answers were inconsistent in a way that made my stomach drop.

$ dig +short orders.internal @10.4.0.2
;; connection timed out; no servers could be reached

$ dig +short orders.internal @10.4.0.3
10.4.12.31

A close-up of dig output showing one resolver answering and another timing out

One of our two internal resolvers was answering. The other was not reachable at all. And the reason some services worked and some did not came down to nothing more profound than which resolver each host happened to try first, and whether the failed query fell through to the working one before the client gave up. A coin toss, essentially, dressed up as a partial outage.

This is the bit that makes resolver failures so genuinely nasty to reason about. With two nameservers in resolv.conf, the usual behaviour is to try the first, wait for it to time out, then try the second. If the first one is dead, every single lookup eats the full timeout before falling through to the working server. On a host that happened to list the dead resolver first, that meant five seconds added to every name resolution, which is enough to blow past every downstream timeout we had and present as total failure. On a host that listed the working resolver first, lookups were instant and the service was completely happy. Same outage, same root cause, two hosts behaving as opposites, which is exactly the kind of thing that makes you doubt your own eyes at quarter past seven in the morning.

the actual cause

The dead resolver had been rebooted overnight by an unattended upgrade. Nothing wrong with that on its own. The problem was that the resolver service did not come back up cleanly after the reboot, because a config file it depended on had been edited a fortnight earlier by someone (me, it turned out, checking the blame) and left with a syntax error that only mattered on a cold start. It had been running happily on its old config the whole time. The reboot was the first thing to make it re-read the file, and the file no longer parsed, so the service refused to start and the host sat there with port 53 closed.

That is the part of this I keep turning over. The change that broke us was two weeks old. It passed every check we had, because every check we had ran against the already-running process, which was serving fine from memory. Nothing tests the cold-start path until something cold-starts. The bug was latent for a fortnight, armed by an upgrade nobody watched, and triggered at 07:14 on a day I had plans.

The immediate fix was dull: correct the syntax, start the service, watch both resolvers answer. Five minutes once I knew. The interesting fix took longer.

We added a healthcheck that actually queries the resolver over port 53 and asserts a real answer, rather than checking that the process exists, because a process can exist and serve nothing. We made the config a tested artefact, so a parse error fails in CI rather than on the next reboot. And we set the client-side resolver timeout down from five seconds to something that fails fast and retries the other server, so a single dead resolver degrades latency by milliseconds instead of taking out half the estate while clients sit on their hands for five seconds at a time.

None of that is clever. It is just the slow accumulation of "the way this broke, make it impossible to break that way again." The lesson I keep relearning is that the outage is rarely the change. The outage is the moment a change you forgot about meets a condition you never tested, usually a restart, usually at an inconvenient hour. And when half of something is down and the other half is fine, suspect the thing that gets chosen at random per request before you suspect anything subtle. It is nearly always DNS.