the day i broke my own dns and blamed everyone else first

Network cabling in a patch panel

It started, as these things do, with someone shouting up the stairs that the internet was down. It was not. The internet was fine. Everything between my house and the internet was on fire, and I had set the fire myself the night before, then gone to bed pleased with my work.

Here is the symptom, because the symptom is always more interesting than the headline. The router had a route to the wider world. I could ping 8.8.8.8 from any machine in the house. I could ping 1.1.1.1. But apt update hung, the kids' tablets sulked, and my phone fell back to mobile data the moment it gave up on the home wifi. Classic shape: the network works, names don't resolve. DNS.

what i had built

A bit of context on the setup, because the architecture is the whole story. I run a small homelab. Pi-hole does ad-blocking and local DNS for the house, and it forwards anything it can't answer to an Unbound instance doing recursive resolution. The clients get a single DNS server handed out by DHCP: the Pi-hole. One address. That detail matters and I will come back to it with the enthusiasm of a man who has learned a lesson.

The night before, I had been tidying. Unbound had been forwarding to a public resolver, and I decided, very reasonably, that I would rather do proper recursion from the root servers and not hand my entire browsing history to anyone in particular. Good instinct. The change is small:

forward-zone:
    name: "."
    # forward-addr: 1.1.1.1   <- removed this

Comment out the forward, restart Unbound, watch a couple of lookups succeed, declare victory, close the laptop. You can already see the shape of the mistake, can't you. I tested that it worked once, not that it kept working.

A rack of networking and server equipment

the actual fault

Two things had quietly gone wrong, and they conspired.

The first: Unbound, doing fresh recursion with a cold cache, was slow on the first hit for any given domain. Not broken, slow. A second or two while it walked down from the roots. Most resolvers and most clients tolerate that fine. Pi-hole, sitting in front, was configured with a forwarding timeout that was, frankly, optimistic. So on a cold cache the chain occasionally timed out before Unbound came back with the answer. Intermittent failures, the worst kind, because they make you doubt your own eyes.

The second was the real sin. When I "tidied" Unbound, I had also, in the same maintenance window, rebooted the Pi-hole host to pick up some updates. And the Pi-hole's own upstream, the line in its config that says where it forwards to, had at some point been pointed at the old Unbound forwarder address that no longer behaved the way it used to. So I had two layers both slightly miscofigured, each masking the other, both pointing at a moving target.

The thing that turned "slow and flaky" into "completely down" was timing. By morning, the Pi-hole's cache had fully expired overnight. Every single lookup was now a cold lookup. Every cold lookup raced the timeout. The hit rate fell off a cliff and the house went dark, name-wise.

how i found it

I will be honest about the order of operations, because it is instructive and slightly embarrassing.

First I blamed the ISP. I checked their status page. Nothing. I rebooted the router, that great ritual of the helpless. No change.

Then I did the thing I should have done first, which is to ask the resolver directly what it thinks:

dig @192.168.1.2 example.com      # pi-hole, slow / SERVFAIL
dig @192.168.1.3 example.com      # unbound directly, eventually fine
dig @1.1.1.1 example.com          # instant

That third line is the tell. Going straight to a public resolver was instant. Going to Pi-hole was a coin toss. Going straight to Unbound worked but took its time. The fault was in my own two boxes, not anywhere out on the wire. From there it was just reading my own config from the night before with fresh, slightly ashamed eyes.

The fix was undramatic. I gave Pi-hole a sane forwarding timeout, let Unbound prefetch popular names so the cache stays warm, and turned on prefetching so it refreshes records before they expire rather than after:

prefetch: yes
cache-min-ttl: 60

what i actually changed afterwards

The config fix took ten minutes. The lesson took longer.

The single point of failure was the design, not the bug. One DNS server in DHCP means one thing to break, and I had lovingly built a tower of two things that both had to be exactly right at the same time. So DHCP now hands out two resolvers, and the second one is deliberately not part of my clever recursive setup. It is a boring public resolver. If my homelab falls over at three in the morning, the house quietly fails over to something that just works, and nobody shouts up the stairs.

The wider point, and the only thing worth taking away: test that a change keeps working, not that it worked once. A single successful lookup against a warm cache proves nothing. The failure mode lived in the cold-cache path, which is exactly the path I never exercised because I tested immediately after my own warm-up queries.

It was my own fault, top to bottom. I find that oddly reassuring. A fault you caused is a fault you understand, and a fault you understand will not surprise you the same way twice. It will surprise me some entirely new way instead, and I look forward to writing that one up too.