it was dns, it is always dns

A terminal full of failing requests

The symptom was that everything was slow, and then nothing worked. Deploys timed out, a couple of services started throwing 500s, and the monitoring went amber across half the estate at once. Nothing had been deployed in hours, which is always the unsettling kind of incident: no change to point at, and yet here we are.

It was DNS. It is always DNS. Earlier in the day someone had tidied a resolver config and dropped one of the two upstreams, the one that happened to be answering most of the internal queries. The remaining one was up but rate-limiting us, so lookups did not fail cleanly. They hung, retried, eventually succeeded, and made every dependent thing look intermittently broken. Intermittent is so much worse than down, because down at least tells you where to look.

The tell, once I thought to check it, was that dig against the service name took three seconds and then worked. Direct IP was instant. From there it was obvious. We put the second upstream back, lowered the resolver timeout so a dud server fails fast instead of hanging, and added a check that actually times a real lookup rather than just pinging port 53.

The lesson is the boring one. A box answering on 53 is not the same as a box answering your queries in time, and your health checks should know the difference.