The Day It Was, In Fact, DNS

A bug on a terminal

It is always DNS. We say it as a joke, right up until the morning it is genuinely, comprehensively DNS, and then it stops being funny and becomes the only thing in the world. This was one of those mornings.

The symptoms made no sense as a single fault. The web tier was throwing intermittent connection timeouts to the database. A background worker was failing to reach an external API, but only sometimes. Health checks were flapping green to red and back with no pattern I could see. Three different teams were each convinced they had broken their own thing, because each of them had, in isolation, a plausible local story. None of them had touched anything.

The tell, when I finally saw it, was the word "intermittent" attached to everything at once. Unrelated systems do not all degrade in the same shape by coincidence. When the failure mode is shared, the cause is shared, and it lives somewhere underneath all of them. The thing underneath all of them was name resolution.

Lines of code on a screen

The proof was almost insultingly simple. Connecting by IP worked every time. Connecting by hostname worked roughly half the time. A quick loop of dig against our internal resolver showed it: one of the two upstream resolvers in the pool was answering slowly, then timing out, then occasionally returning SERVFAIL. Round-robin meant every other query, give or take, hit the sick one. Half your lookups failing looks, from the application's point of view, exactly like a flaky network, a flaky database, and a flaky API all at once. Because it is all of them, sort of, and none of them.

for i in $(seq 1 20); do
  dig +short +time=2 db.internal @10.0.0.53 || echo "FAIL"
done

Ten clean answers and ten timeouts in twenty tries. There it was.

The root cause was dull, as the good ones are. One resolver instance had got itself into a bad state after a memory pressure event overnight, was not fully dead so health checks still passed, but was answering a fraction of queries with garbage. Half-dead is so much worse than dead. A dead resolver gets pulled from rotation. A half-dead one stays in, poisoning every other query, while every team downstream chases ghosts in their own logs.

We pulled the bad instance, queries went back to a hundred percent, and all twelve "separate" problems healed simultaneously, which is the single most satisfying thing in this job and also the most humbling. Twelve incidents, one cause.

The lesson I keep filing away is to trust the shape of a failure more than its location. When everything breaks a little, at the same time, in the same intermittent way, stop debugging the applications. They are telling you the truth: they are all fine, and something they all depend on is not. Look down the stack, not across it. And yes, check DNS first. We have a name for ignoring that advice. It is "the rest of the morning".