Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage that wasn't down, it was just lying

A production incident that looked like a total outage but was really a resolver cache poisoned by a half-applied config change, and what it taught me about debugging under pressure.

A terminal full of error output on a dark screen

Everything was down. That was the report, and for the first ten minutes it looked true. The dashboards were red, the app couldn't reach its database, the database couldn't reach the cache, and the cache couldn't reach anything either. When every arrow on the architecture diagram goes red at once, the instinct is to look for the one shared thing underneath them all, and the one shared thing is almost always either the network or DNS. This time it was DNS, and the way it failed is the reason I'm writing it down.

the symptom didn't match the cause

The first misleading thing was that nothing was actually down. Every host was up. Every service was running and healthy if you hit it by IP. You could SSH into any box, you could ping any box, the load balancers were passing health checks against their backend IPs perfectly happily. By every measure that involved an IP address, the system was fine. It was only when a name needed resolving that things fell over, and modern stacks resolve names constantly: service discovery, database connection strings, internal API calls, certificate validation, all of it leans on DNS far more than the architecture diagram suggests.

So we had the classic split: machines healthy, names broken. That narrows it fast, and it should have narrowed it faster than it did. We lost time because the errors didn't say DNS. They said connection timeouts, they said connection refused, they said TLS handshake failures. An application three layers up doesn't know its connection failed because a name didn't resolve. It just knows it couldn't connect, and it reports the symptom at its own layer, not the cause at the bottom.

There's a particular flavour of this with timeouts that's worth dwelling on, because it cost us a couple of minutes of genuine confusion. A failed resolution doesn't always fail fast. Depending on the resolver and the libc settings, a query that gets no useful answer can sit there retrying until the resolver timeout expires, and only then does the application give up. So what should be an instant "no such name" presents instead as a slow, hanging connection that eventually times out. That made some of the failures look like a slow database rather than a broken name, and we briefly went looking at database load that was, of course, perfectly fine.

chasing the wrong resolver

Here's the bit that genuinely caught us out. We had two internal resolvers behind an anycast address, and a config push earlier that day had updated the forwarding rules on both. Except it hadn't, quite. The deployment had applied cleanly to one and silently failed on the other, leaving one resolver serving correct answers and the other serving SERVFAIL for an entire internal zone. With both behind the same address, every query had a coin-flip chance of hitting the broken one.

That intermittency is what turned a ten-minute fix into a forty-minute one. We'd dig a name, get the right answer, and conclude DNS was fine. Run it again, get SERVFAIL, and conclude it wasn't. The same command gave different answers seconds apart, which is exactly the kind of thing that makes you start doubting your tools instead of your infrastructure.

$ for i in $(seq 5); do dig +short internal.svc.example @10.0.0.53; done
10.0.4.21

10.0.4.21

Two of those five came back empty. Once we saw that pattern, the rest fell into place. We pulled the broken resolver out of the anycast pool, the coin flip stopped, and the whole estate recovered within a minute or two as the bad cached negatives aged out.

A wall of code and log output

what i actually changed afterwards

The fix was easy once we understood it. The lessons were the expensive part, and there were three that stuck.

First, intermittent failures behind a load-balanced or anycast address are misery to debug precisely because your diagnostic command and your application's request might land on different backends. Now, when I'm chasing one of these, I query each backend by its real address rather than the virtual one, so I'm comparing like with like.

Second, our deployment had failed on one host and reported success overall. That's the actual bug, and it's a boring one: a config push that doesn't verify each target individually is a config push that will eventually leave you in a split-brain state and lie about it. We added a post-apply check that resolves a canary name against each resolver and fails the deploy loudly if any of them disagree.

Third, and this is the human bit: when everything is red, resist the urge to treat the symptoms at the top of the stack. A hundred services failing to connect is one story, not a hundred. Find the shared dependency, prove it healthy or broken, and only then move up a layer. We spent too long reading application logs that were all faithfully reporting a problem they didn't have.

There's a fourth thing I've come to believe since, which is about how we built the diagnosis. Once we suspected DNS, the fastest way to prove it would have been a known-good query we could trust: a canary name with a known answer, resolved against each resolver by its real address, run in a tight loop. We assembled that ad hoc during the incident, under pressure, with everyone watching. It would have taken thirty seconds to have it sitting ready. So now it does. There's a tiny script that resolves a handful of canary names against every internal resolver individually and shouts if any of them disagree or fail, and it runs both on a timer and on demand. The point of it isn't to catch the failure automatically, though it does. The point is that when the next weird DNS thing happens, the first diagnostic question, "are all the resolvers agreeing?", has an answer I can produce in one command instead of inventing on the spot.

It's worth saying plainly that none of the individual facts here were obscure. Anycast hides which backend you hit. Errors report at the wrong layer. A half-applied deploy leaves you split-brained. Every one of those is in the ops folklore, and I'd have nodded along to all of them in a calm room. The incident wasn't hard because any single fact was hard. It was hard because four ordinary things stacked up at once, under pressure, with the symptoms pointing everywhere except the cause. That's what real outages mostly are: not one clever failure, but several boring ones arriving together and conspiring to look like something else.

The whole incident lasted under an hour and nothing was lost, which by outage standards is a good day. But "everything is down" turned out to mean "one resolver is quietly answering wrong, half the time", and the gap between those two sentences is most of what incident response actually is.