There is a joke in operations, old enough to be on a t-shirt: it's always DNS. The reason it is funny is the reason it is dangerous. We say it so often, half as a reflex, that when it actually is DNS we have stopped believing ourselves. This is the story of a morning where it really was DNS, and how long it took us to trust the joke.
It began looking like a database problem. Requests across several services started timing out, and the timeouts all traced back to database calls hanging. The obvious read was that the database was struggling, so the first twenty minutes went into the database: connection pool, slow query log, replication lag, locks. The database was fine. CPU was idle, queries that did run were fast, the pool had spare connections. It was simply that hardly any queries were arriving, and the ones that did were preceded by a long pause.
That pause was the clue I walked straight past. Once a connection was established, everything was quick. The cost was all in getting connected, and getting connected meant resolving the database's hostname first.
$ time getent hosts db-primary.internal
db-primary.internal 10.0.2.30
real 0m4.812s
Nearly five seconds to resolve an internal hostname that should take single-digit milliseconds. There it was. Not the database. The lookup of the database.
Following it down
Every service in the estate connects to its dependencies by name, not by IP, which is correct and which I would do again. But it means every single connection begins with a DNS query, and if DNS is slow, everything is slow, uniformly, in a way that looks like a general malaise rather than a specific fault. That is what made it so hard to localise. There was no single broken component to find, because the broken thing sat underneath all of them.
$ dig db-primary.internal @10.0.0.2
;; Query time: 4806 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
Pointed directly at our internal resolver, the query took nearly five seconds. Pointed at a public resolver, for a public name, it was instant, which told me the network path off the box was fine and the problem was the internal resolver itself.
On the resolver host, the picture finally made sense. The DNS process was pinned, query latency through the roof, and the query log was a wall of repeated lookups for a handful of the same internal names, over and over, far more than our actual traffic should generate.
Why it fell over
The trigger was a deploy earlier that morning, naturally. A service had been rolled out with a client library whose connection handling had changed, and it was now resolving the database hostname on every single request rather than resolving once and caching. Multiply that by the request rate and you have a service hammering the internal resolver thousands of times a second for the same name. The resolver had no meaningful negative or positive cache tuning for that load because it had never needed any, and it simply fell behind. Once it was saturated, everyone else's lookups queued behind the flood, and the whole estate slowed to the speed of an overwhelmed name server.
So the chain was: one deploy, one library behaviour change, one hostname resolved far too often, one saturated resolver, every service in the company suddenly slow, all of it presenting as a database problem because the database happened to be the dependency everyone noticed timing out first.
Stopping the bleeding, then fixing it
The immediate fix was to roll back the offending deploy, which dropped the query rate back to sane levels and the resolver recovered within a minute. Sub-millisecond lookups, timeouts gone, dashboards green. The whole outage, from first alert to recovery, had run a little over an hour, and a depressing fraction of that was spent investigating a database that was never unwell.
The durable fixes came after, in the calm. We put a small caching resolver on each host so that repeated lookups for the same name are answered locally and never touch the central resolver at all, which both speeds things up and means a single overloaded resolver can no longer take the estate down. We added a second resolver and made the clients use both, so the load is shared and there is no longer one box whose bad day is everyone's bad day. And we added monitoring on resolver query latency and query rate, because the most galling part of the whole thing was that the one metric that would have pointed straight at the cause was the one metric we were not watching.
The lesson is not "cache DNS", though you should. It is that a shared dependency sitting underneath everything will, when it fails, disguise itself as a failure of the things on top of it. DNS, time, the network, the certificate store: the truly load-bearing services are invisible precisely because they normally just work, and when they break they break in the costume of whatever sat closest to the user. So when it's always DNS, and you have ruled out the obvious culprit, do yourself a favour and time a name lookup early.
One last thing I keep on a sticky note now, because the morning proved it: when an outage is uniform, when everything is a bit slow rather than one thing being completely broken, look down, not across. A specific failure points at a specific component. A general slowness points at something everyone shares, and the things everyone shares are exactly the ones nobody owns and nobody watches. DNS, NTP, the shared certificate authority, the load balancer in front of all of it. We laugh about DNS because it is true. I no longer laugh quite so quickly, and I check it a great deal sooner.