Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

it's never dns, until it's the only resolver in the house

A single resolver going down took the whole house offline, and the failure was as much about my topology as about the box that died.

A terminal showing an error and a small insect graphic

Everything in the house went dark at once on Sunday morning, which is unusual enough to be interesting rather than just annoying. Not slow. Not flaky. Dark. The TV couldn't reach anything, my phone gave up, the work laptop sulked. But the moment I pinged a raw IP address, the packets flew. The internet was fine. The names were gone.

You learn to recognise the shape of a DNS failure the way you recognise a particular cough. Everything that uses a hostname fails, everything that uses an address works, and it all happens simultaneously across completely unrelated devices. There's no single application at fault because the thing they share isn't an application. It's the resolver.

A close-up of source code on a screen

I'd built the network so that every client got exactly one DNS server via DHCP: the little Pi-hole box I'm rather fond of. It does ad-blocking, it does local names, it logs queries, it's genuinely useful. It is also, it turns out, a single point of failure I'd cheerfully constructed with my own hands. The SD card had finally given up the ghost overnight, the service was dead, and because every device in the house had been told "your one and only resolver is at this address", every device in the house had nowhere else to ask.

The immediate fix was embarrassingly quick once I'd stopped panicking. I pointed myself at a public resolver by hand, confirmed names came back, then reflashed the Pi-hole onto a fresh card from the config backup I had (thankfully) been keeping. Twenty minutes, most of it waiting for the card to write.

The real fix was structural. Handing out a single resolver is asking for exactly this. So DHCP now offers two: the Pi-hole as primary, and a second resolver as the fallback. The trade-off is that when the Pi-hole is up, some queries still leak to the secondary and dodge the filtering, which slightly annoys the purist in me. I'll take a little ad-blocking leakage over the entire house falling over because one SD card wore out.

The other change was the boring, valuable one: monitoring. There's now a tiny check that resolves a known name against the Pi-hole every minute and shouts at me if it stops answering. The outage on Sunday lasted long enough for three separate people to find me and ask if the wifi was broken. A thirty-second alert would have let me fix it before anyone noticed. The card was always going to die eventually. The lesson wasn't about the card. It was that I'd built a topology where one cheap component dying meant total darkness, and called it a feature.

There's a smug aphorism in ops circles that it's always DNS, trotted out whenever anything breaks. The annoying thing is how often it's true, and it's rarely DNS being broken in itself. It's DNS being the chokepoint that everything else quietly depends on, the one service whose absence isn't a degraded experience but a total one. Build it like the load-bearing thing it is, with a fallback and an alarm, and it stops being the punchline.