Everything in the house went quiet at about nine on Saturday evening. Not down, exactly. The wifi was up, the switch lights were blinking, the internet was fine if you typed an IP address. But every name failed. No DNS, no anything-that-uses-names, which in 2023 is everything. It took me longer than I would like to admit to work out that I had done it to myself, and the shape of the mistake is one worth confessing because it is so easy to make.
I run Pi-hole as my resolver. Lovely bit of kit, blocks the ad networks, gives me query logs, all good. The problem is where I had put it. I had recently moved it into a container on the same Docker host as a pile of other services, and I had pointed that host's own /etc/resolv.conf at the Pi-hole container. Which is fine until the moment you restart Docker.
Here is the loop. The Docker daemon needs DNS to pull and start containers. The DNS server is a container. So on a daemon restart, Docker tries to resolve the registry, fails because the resolver it depends on is not up yet, and the resolver cannot come up cleanly because the daemon that runs it is mid-restart. A textbook circular dependency, and I had built it by hand without noticing, because in steady state it works perfectly. It only bites at the exact moment you most want things to recover.
What actually triggered it was an unattended-upgrades run that restarted the Docker service in the small hours, except I had been poking at it earlier so the restart landed while I was watching the lights go out.
The fix was embarrassingly simple. The host should never depend on a service it hosts for something as fundamental as name resolution. I gave the host a static, boring upstream resolver of its own (a public one, plus the router) and let only the client devices on the network point at Pi-hole. The host resolves names without needing the container; the container provides names to everyone else. The cycle is broken because the host no longer sits inside it.
The general rule I took away, and have written on a sticky note: a thing must not depend on a service that depends on it to start. Bootstrap dependencies have to flow one way. DNS, time, and the container runtime are the three I keep tripping over, because they feel like infrastructure that is "just there" right up until it is the thing that is not there. Half an hour of darkness and a lot of squinting at resolv.conf to relearn something I already knew.