Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

the day my whole house lost dns, and it was me

A homelab DNS outage I caused myself by being clever with a single resolver, and the small redundancy that fixed it for good.

A patch panel with a tangle of network cables

It started, as these things do, with someone in the house saying "the internet's broken." It was not the internet. The internet was fine. The thing that was broken was the small Pi sitting under the stairs running Pi-hole, and it was broken because of me.

Here is the lesson up front, so you can leave early if you like: do not run your entire home network's DNS on a single box, and then take that box down to "just quickly" reflash the SD card. The "just quickly" is load-bearing, and it is a lie.

what actually happened

I had a single Pi-hole instance. It did DNS for the whole house, handed out over DHCP from the router as the one and only resolver. That worked beautifully for about a year, which is exactly the problem. It worked so well that I stopped thinking about it, and a thing you have stopped thinking about is a thing you will eventually break carelessly.

That afternoon I wanted to move Pi-hole to a slightly beefier Pi 4 I had freed up. Sensible. The right way to do that is to stand the new one up, get it serving, point DHCP at both, and only then retire the old one. The way I did it was to pull the SD card out of the running Pi to clone it, because I reasoned it would only be a few minutes.

It was not a few minutes. The clone of a 32GB card to an image and back onto a new card took the better part of half an hour, and during that half hour every device in the house was pointed at a resolver that no longer existed. Nothing could resolve a name. To the people in the house, every website, every app, the TV, the lot, was simply down. The router was up, the WAN was up, DHCP was even still handing out the dead resolver's address with great confidence.

A rack of equipment in a small datacentre

why it looked like a total outage

This is the part that makes DNS failures so disorientating. Nothing returns an error that says "DNS is down." You get timeouts, or "server not found," or apps that just spin. Connectivity is fine; the names are gone. I sat there with a working SSH session to the router by IP, pinging 1.1.1.1 successfully, while the rest of the house was convinced the line was dead. Everything that was wrong was downstream of one tiny service I had personally switched off.

I could have got everyone moving again in thirty seconds by setting a device's DNS manually to a public resolver. I didn't, because I was head-down in the migration and didn't clock how total the impact was until someone came to find me. That is its own lesson: when you take a shared service down, the blast radius is the whole household, not just your terminal.

the fix, which was embarrassingly simple

The actual repair was to finish the migration and move on. The real fix, the one that means this cannot happen to me the same way again, was to stop having a single point of failure for something this fundamental.

I now run two resolvers. The new Pi 4 runs Pi-hole as primary. A second, much smaller Pi Zero runs a plain unbound instance as a bare recursive resolver with no fancy blocking. DHCP hands out both:

# dnsmasq.conf on the router
dhcp-option=6,192.168.1.2,192.168.1.3

The first address is the Pi-hole. The second is the boring fallback. If the Pi-hole is down for maintenance, or because I have pulled its card out again like a fool, clients fail over to the second resolver. They lose ad blocking for a few minutes. They do not lose the ability to load a web page.

A couple of details that are easy to get wrong here. Clients do not round-robin between primary and secondary in a clean way; the behaviour when the primary is unreachable varies by OS and is generally "try the first, wait, give up, try the second." So the failover is not instant and not pretty, but it is the difference between a brief stutter and a total blackout. If you want both resolvers actively load-balanced you need something cleverer like keepalived and a shared VIP, which for a home network is firmly over-engineering.

I also added a tiny uptime check. A cron job on the router pings both resolvers and sends me a Telegram message if either stops answering on port 53. Nothing clever, just dig @192.168.1.2 example.com +short and a check that it returned an address.

the actual takeaway

The technical lesson is the obvious one about redundancy, and you already knew it. The honest lesson is about how the longest-running, most reliable bit of your setup is the one most likely to bite you, precisely because you have stopped respecting it. A service that has run flawlessly for a year is not safe. It is just a thing you have learned to take for granted, which is a different thing entirely.

DNS is infrastructure. I was treating it like a hobby project, because that is how it started, and the gap between those two things is exactly where the outage lived. Two resolvers and a thirty-line health check later, I treat it like infrastructure. The internet, I am pleased to report, is no longer broken.