The pager went off for the API gateway, then the billing worker, then the metrics pipeline, all within about ninety seconds. Three teams, three Slack threads, three people independently certain it was the other person's fault. The graphs all looked the same: latency climbing, then a wall of timeouts. When everything breaks at once and the things have nothing in common, the thing in common is usually underneath all of them.
The errors were connection timed out, never connection refused. That distinction matters. Refused means something answered and said no. Timed out means nobody answered at all, which points at the path rather than the destination. The destinations were healthy. I could curl them by IP from the affected boxes and get an instant reply.
By name, though, it hung. dig against the local resolver took five seconds and came back SERVFAIL. One of our two internal resolvers had wedged after a config reload earlier that afternoon, and resolv.conf listed it first with a five second timeout before it failed over to the second. So every single outbound call across every service was paying a five second DNS tax, then succeeding, which from the application's point of view looked exactly like the whole world had gone slow.
Restarting the wedged resolver fixed it in under a minute, which is the embarrassing part. The real fix came after: drop the resolver timeout to one second, make the two resolvers genuinely independent rather than sharing the same upstream forwarder, and add a synthetic check that resolves a known name every fifteen seconds and pages on SERVFAIL directly. It is always DNS, right up until you actually monitor DNS, at which point it becomes the first thing you rule out instead of the last.