the day my homelab lost its name

Network cables in a patch panel

The outage started, as these things do, with me being absolutely certain it wasn't DNS. It was DNS. It is always DNS, and the only novel thing this time was that I had personally caused it about forty minutes earlier and then gone to make a coffee.

The symptom was beautifully unhelpful. Home Assistant could no longer reach the printer. Then Grafana stopped scraping half its targets. Then my partner asked, with the particular flatness that means the WiFi had become my problem, why "the internet" was down. The internet was fine. Resolution of anything ending in .lan was not.

what I'd actually done

Earlier that afternoon I had been tidying. That word should carry a warning label. I run a pair of Pi-hole instances as the primary and secondary resolvers, each forwarding upstream to a small unbound doing the actual recursion. The two Pi-holes are meant to be identical twins so that either can die and nobody notices.

I had decided the conditional forwarding config was untidy. I wanted local hostnames from the router's DHCP lease table to resolve, so I'd set up conditional forwarding to point at the router for the 192.168.10.0/24 reverse zone. Fine. Except I'd renumbered that VLAN back in spring, and the router now lived on 192.168.20.1. The forwarder was pointing at an address that no longer answered.

# pihole conditional forwarding, as it sat broken
CONDITIONAL_FORWARDING=true
CONDITIONAL_FORWARDING_IP=192.168.10.1
CONDITIONAL_FORWARDING_DOMAIN=lan
CONDITIONAL_FORWARDING_REVERSE=10.168.192.in-addr.arpa

The clever part, and I use clever in the most sarcastic register available, is that I had made the edit on both Pi-holes. The whole point of the redundant pair is that they fail independently. I had carefully and symmetrically broken both of them at once. Belt and braces, both cut with the same scissors.

A rack of homelab equipment with blinking lights

the forty minutes of denial

What makes a self-inflicted outage special is the denial phase, where you debug everything except the thing you just changed. I checked upstream connectivity. Fine. I restarted unbound. Fine. I ran dig against 8.8.8.8 directly and got answers, which proved nothing except that Google's resolvers were having a better day than I was.

The tell, which I walked straight past twice, was that external names resolved perfectly and only .lan names hung. A name that hangs rather than returning NXDOMAIN is a name being forwarded to something that will never reply. The query went out to 192.168.10.1, into the void, and sat there until the client gave up. Classic black hole. I was watching the timeout, not the failure.

The thing that finally landed it was the most boring tool in the box:

dig +short printer.lan @127.0.0.1
; <<>> waiting for an answer that is not coming <<>>

Five seconds of nothing, then a SERVFAIL. The moment you see a five second pause before a DNS failure, stop blaming the network and go and read what you changed. I checked the Pi-hole admin log, saw the forwarder address, and felt that specific warm flush of recognising your own handwriting on the murder weapon.

the fix, and the lesson I'll forget

The fix took eleven seconds. Point the forwarder at 192.168.20.1, restart pihole-FTL, watch the .lan names spring back to life. The printer reappeared. Grafana started scraping. The WiFi, which had never been broken, was declared fixed by popular acclaim.

The lesson, the one I write down every time and somehow never internalise, is to stop treating my redundant pair as redundant when I'm the one doing the editing. Two identical resolvers protect you from hardware faults and from each other. They do nothing whatsoever to protect you from your own config drift, because you apply your own config drift to both of them with great care.

What I've actually changed since: the homelab DNS config now lives in a git repo, and the deploy is a small Ansible play I run against one host, verify, then the other. Not because Ansible is magic, but because the verify-before-the-second-host step is now structural rather than a thing I remember to do when I'm feeling sensible. I am rarely feeling sensible at 4pm on a Wednesday with a coffee going cold.

I also added a synthetic check in Uptime Kuma that does nothing but resolve printer.lan every thirty seconds and goes red if the answer takes longer than two seconds. It would have caught this in half a minute instead of forty. The cheapest monitoring is usually the thing you already broke once.

DNS didn't betray me. It did exactly what I told it to, with admirable patience, forwarding every local query to an address I'd abandoned months ago. The machine was fine. The operator needed a word with himself.