Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

the time a stale ttl outlived my actual mistake

A self-inflicted internal DNS outage caused by a fat-fingered record and a long TTL that kept the wrong answer alive for hours.

Network cables coiled at the back of a rack

Half my internal services disappeared on Saturday afternoon and it took me an embarrassingly long time to admit it was me. The web UIs were down, the reverse proxy was throwing errors, but the boxes themselves were up and pingable by IP. Classic DNS shape: the machines are fine, the names are lying.

I'd been renumbering a subnet earlier in the day and updated the A records on my internal resolver to match. Or rather, I updated most of them. One service got pointed at the old address, the one I'd just decommissioned, because I copied the line above and didn't change the last octet. Garden-variety fat finger. I'd have caught it in thirty seconds, except for the second half of the mistake, which is the bit actually worth writing down.

A small server rack with status lights

My internal zone had a default TTL of 86400. A full day. So when I noticed the bad record, fixed it, and reloaded the zone, nothing changed. Every client and every forwarder between me and them was still cheerfully serving the wrong answer it had cached, and would keep doing so for hours. I sat there reloading the zone, confirming the correct answer with dig @localhost, and watching the rest of the house get the old one. The fix was already in. The lie just had a long shelf life.

The actual recovery was tedious rather than clever: flush the caches I could reach, restart the forwarder, and on a couple of stubborn clients clear the resolver cache by hand. Within an hour the long TTL had aged out everywhere and the network healed itself, which is the maddening thing about TTL problems. The system is working exactly as designed. You told it to trust an answer for a day, and it did.

Two changes came out of it. First, my internal zone now runs a much shorter default TTL, 300 seconds, because internal records change far more often than public ones and I'd rather pay for a few more queries than wait a day for a typo to expire. The textbook reason for long TTLs is reducing load on authoritative servers, and on a home network that load is rounding error. Second, I stopped editing zone files by copy-paste-and-pray. There's a tiny script now that bumps the serial and validates with named-checkzone before it'll reload, so at least the syntax is sound even when my brain isn't.

The typo was a five-second mistake. The TTL turned it into a five-hour one. Worth remembering that DNS doesn't just store your answers, it stores how long it'll keep believing the wrong ones.