Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

it was dns, it is always dns

A morning where nothing worked, every service looked down, and the actual fault was a single resolver quietly timing out.

A terminal showing a stack of error output

The morning started with everything broken at once, which is the most misleading symptom there is. The app couldn't reach its database. The deploy pipeline was failing to pull images. Monitoring was screaming about half a dozen unrelated services. When everything breaks together it almost never means everything broke. It means one thing under all of it broke.

That one thing was DNS. A resolver the whole fleet pointed at had started timing out rather than answering, so every lookup sat there for five seconds and then failed. Nothing was actually down. The database was up, the registry was up, the services were up. They just couldn't find each other by name, and a name that takes five seconds to fail to resolve looks, to everything above it, exactly like a dead service.

The tell, once I stopped staring at the application errors and went down a layer, was dig. A query against the misbehaving resolver hung and timed out; the same query against a different one came back instantly. That's the whole diagnosis. Point the hosts at a healthy resolver, watch the cascade of unrelated failures clear in one go, and feel slightly foolish for the twenty minutes spent suspecting six things that were all fine.

The fix afterwards was redundancy I should already have had: a second resolver in the config so a single sick one doesn't take the lot down, and a check that actually queries DNS rather than just pinging the box. The lesson is the old one, learned again. When a lot of things break at the same instant, don't start at the top. Drop a layer, and a layer, until you find the single boring thing holding everything else up. It is usually DNS. It is always DNS.