Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

it was never the network, it was always dns

An outage that looked like everything was on fire turned out to be one resolver quietly failing, as it always does.

A terminal showing an error during debugging

It started, as these things do, with everything appearing to break at once. Internal services timing out, deploys hanging, a dashboard turning the colour of a fire alarm. The kind of morning where five people ping you within thirty seconds and each has a different theory.

The instinct is to believe the scary story: a region is down, a dependency has fallen over, something fundamental has shifted under us. The reality, almost always, is smaller and more annoying. Connections weren't being refused, they were timing out, and timeouts where refusals should be is the smell of name resolution gone wrong. A dig against the service name took five seconds and came back empty.

One resolver in the cluster had wedged. Everything pointed at it through a single config value nobody had thought about in two years, so when it stopped answering, the whole estate slowly seized up as caches expired. Not down, just unable to find anything, which from the outside looks identical to down.

We failed traffic to the secondary resolver, the timeouts cleared within a minute, and the fire alarm went green again. The fix afterwards was the boring, correct one: more than one resolver in the path, sensible health checks, and a much shorter timeout so a dead resolver gets noticed in seconds rather than minutes. It is always DNS. I keep learning this, and the lesson keeps being true.