The alerts came in all at once, which is the worst kind of alert because it tells you everything is broken and therefore nothing in particular is. Half a dozen services, all suddenly unable to reach their dependencies, throwing connection errors that looked like the world had ended. My first thought was the network. My first thought is always the network. It's almost never the network.
It was DNS. Of course it was DNS. One internal resolver had fallen over, and far too many things were pointed at it and only it. So every name lookup hung, then timed out, and every service that does anything useful starts by looking up a name. The packets could have flowed perfectly well if anything had known where to send them.
What gave it away was a single dig against a known-good name that just sat there until it timed out, while a ping to a raw IP went through instantly. Connectivity fine, resolution dead. From there it was a short walk to the resolver, which had run out of file descriptors and stopped answering.
The real fault was upstream of the resolver dying, though. It was that I'd let a single resolver become load-bearing for the whole estate without a second one in the lists. The fix was a fallback in every client's config and an alert on the resolver itself, so next time it's a page about one box rather than a flood about all of them. DNS will always be the thing that takes you down. The least you can do is make it take you down quietly.