Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

The Day I Took Down My Whole Network With One DNS Change

A homelab DNS outage caused entirely by me, the circular dependency that made it un-fixable from inside, and the boring resilience I should have had from the start.

A rack of network cables

It started, as these things do, with a tidy-up. I had two DNS resolvers running on the homelab, an old setup and a newer one I had been meaning to consolidate onto. On a Sunday afternoon, feeling organised, I decided to retire the old one and point everything at the new resolver. Half an hour later the entire house had no internet, the resolver I needed to fix was unreachable by name, and I was logged into a router by IP address trying to remember how I had got into this mess.

The mess was a circular dependency, and I had built it with my own hands.

What I actually did

The new resolver ran in a container on one of my servers. Sensible enough. I updated the DHCP server to hand out its address as the only DNS resolver on the network, removed the old one, and felt productive. Then I rebooted the host to apply an unrelated update, because why not do two things at once on a quiet afternoon.

Here is the part I had not thought through. The container host pulled its own configuration, including a couple of image references and a registry login, using hostnames. Those hostnames resolved through DNS. Which resolver did it use? The one running in a container on itself. Which was not up yet, because the host was still booting and the container had not started.

So the host could not start the container, because starting the container needed DNS, and DNS was the container. A perfect little ouroboros, and I had pointed every other device on the network at the same dead address.

The cruel part is that none of this was visible while everything was running. The old resolver had been answering quietly in the background, masking the dependency. The container had been pulled long ago and cached, so previous restarts found the image locally and never needed to reach the registry. It was only the combination of retiring the fallback and rebooting the host, on the same afternoon, that exposed a loop which had technically existed for months. The system had been one cold boot away from this the whole time, and I had no idea.

A datacenter aisle of equipment

The scramble

The symptoms were total. Phones showed connected wifi with no internet. The TV spun forever. My partner appeared in the doorway with the particular expression that means the wifi has gone again and she suspects, correctly, that I am responsible.

The first instinct, to SSH into the resolver host and check the container, failed immediately, because I tried to SSH by hostname and there was no DNS to resolve it. That is the moment the scale of it lands. You cannot use your tools because your tools rely on the thing that is down.

I got in by IP eventually, which meant digging through the router's lease table to find the host, because I had not memorised it. Once on the box I could see the container had failed to start with an image pull error, the host having tried to reach the registry by name during boot and got nothing back. I pointed the host's own /etc/resolv.conf at a public resolver, restarted the container service, and within a minute name resolution came back for everything. The whole outage was about twenty-five minutes, most of it spent finding an IP address I should have known.

Worth saying: the diagnosis took longer than the fix. The actual repair was two lines in a config file and a service restart. The expensive part was the dawning realisation of why it was broken, sat at a terminal trying to ping things by name and getting nothing, slowly assembling the loop in my head. That is the signature of a circular dependency outage. The fix is trivial once you see the shape of it, and seeing the shape of it is the whole job.

A quick way to confirm what is actually happening, once you are on the box, is to ask the resolver directly rather than relying on whatever the system is configured to use:

# does name resolution work at all, via a known-good upstream?
dig @1.1.1.1 example.com +short
# and via the configured resolver?
dig example.com +short

When the first works and the second times out, you know the upstream is fine and your own resolution path is the problem. That single distinction would have shortened my twenty-five minutes considerably, had I reached for it sooner instead of poking at SSH.

What I should have had

The fix is embarrassingly standard, which is the worst kind of fix, because it means I knew better.

A network should never depend on a single resolver, least of all one that depends on the network to start. The corrections, in the order I made them afterwards:

  • The DHCP server now hands out two resolvers, the homelab one and a public fallback. If mine dies, devices quietly fail over and most people never notice.
  • The container host's own resolv.conf points at a public resolver, not at itself. A host must be able to resolve names well enough to start the very service that provides resolution to everyone else. Anything else is a bootstrap loop waiting to happen.
  • I keep a short note of the critical static IPs somewhere I can reach without the network, because "log in by IP" is no use if you do not know the IP.
# /etc/resolv.conf on the resolver host
# point at upstream, never at the local container
nameserver 1.1.1.1
nameserver 9.9.9.9

The lesson, again

The honest lesson is not technical, it is about humility on quiet afternoons. The change itself was fine. The mistake was making a critical service depend on itself and removing the only fallback in the same sitting, then rebooting for good measure. Any one of those alone would have been survivable.

Circular dependencies in infrastructure are sneaky precisely because everything works right up until the one moment a cold start exposes them. A running system hides them perfectly. You only meet the loop when something reboots, which is to say at the worst possible time, on a Sunday, with an audience.

I now have a small rule taped, metaphorically, to the inside of my skull: before you remove the old thing, make sure the new thing can survive a reboot on its own. The boring redundancy is boring for a reason. It is the bit that saves you from yourself.