I've run my own recursive resolver for a while now, and it had been boringly reliable, which is the only kind of reliable worth having. Then one morning, nothing resolved. Not some things. Nothing. Every lookup came back SERVFAIL, the house went quiet, and I had nobody to blame but the person who'd insisted on doing DNS himself.
The instinct is to assume the resolver is broken. The harder, more useful assumption is that the resolver is doing exactly what you told it, and what you told it was wrong.
following the SERVFAIL
SERVFAIL with DNSSEC validation on is usually one of two things: the data really is bad, or your validator can't trust it. I ran the query by hand and asked Unbound to tell me why:
dig @192.168.1.2 example.com +dnssec
unbound-host -D -v example.com
The verbose output said the validation was failing on a signature whose inception time was in the future. A signature can't be valid before it's been made. Which meant either the entire DNS had travelled back in time, or my box's clock had.
it was the clock, of course
The little resolver box had lost NTP at some point and quietly drifted. Not by much, but DNSSEC signatures have tight validity windows, and "not much" was enough to push the system clock outside the window for freshly rotated signatures. The cryptography was working perfectly. It was correctly refusing to trust records that, as far as my wrong clock could tell, hadn't been signed yet.
Fixing the clock fixed everything in seconds:
systemctl restart systemd-timesyncd
timedatectl status
The second issue, which I found whilst I was in there, was a trust anchor I'd half-managed by hand. Unbound can maintain the root key itself via RFC 5011 if you let it own the anchor file, and you point auto-trust-anchor-file at something it can write:
server:
auto-trust-anchor-file: "/var/lib/unbound/root.key"
I'd been treating that file as read-only config. It isn't. It's state, and the resolver needs to update it as the root key rolls.
the lesson I keep relearning
DNSSEC didn't fail me. It did its job and refused to lie. The failure was upstream of the cryptography, in the dull infrastructure underneath: a clock and a key file. When you run your own resolver you also own its time, its trust anchors, and the privilege of breaking your entire house with a drifted clock. I'd still do it. I just check NTP first now, every single time.