It's always DNS. You know it's always DNS. I know it's always DNS. And yet on a Tuesday afternoon I spent the best part of two hours convinced it was anything but DNS, because the symptoms were so thoroughly wrong that the obvious answer felt too obvious.
Here is the short version: a database migration moved one of our internal Postgres instances to a new host. The migration itself was clean. The cutover was clean. The new instance came up, replicated, passed its health checks, and accepted connections. And then, gradually, over about fifteen minutes, half the fleet stopped being able to reach it whilst the other half was perfectly happy. Not all-or-nothing. Half. That "half" is what cost me the two hours.
what it looked like
The first alerts were generic. Elevated 5xx on two services, climbing slowly. The error in the application logs was a connection timeout to the database, which immediately points you at the database. So I looked at the database. It was fine. Connection count normal, CPU bored, replication lag in single-digit milliseconds. I could connect to it by hand from my laptop without a moment's hesitation.
So if the database is fine, it must be the network. I went and looked at the network. The network was also fine. No packet loss between the affected hosts and the new instance, latency flat, no firewall changes in the change log. I could nc -vz newhost 5432 from an affected box and it connected instantly. Which made no sense, because the application on that same box could not.
That contradiction, my own nc works but the app times out, is the kind of thing that makes you start doubting reality. I checked connection pool limits. I checked for a thundering herd of retries. I checked whether someone had quietly deployed something. I read the app's database config three times. I even briefly suspected the auth layer, because a few of the timeouts were preceded by a TLS handshake that never completed, and a half-finished handshake smells like a certificate problem.
It was none of that.
the actual cause
The new host had a new IP. Obviously. We'd updated the internal DNS record to point at it, with what we believed was a sensible 30-second TTL. The cutover plan assumed that within thirty seconds, every client would resolve the new address.
Two things were wrong with that assumption.
First, the application was using a connection pool that resolved the hostname exactly once, at startup, and then held onto the resolved IP for the life of the process. It never re-resolved. So any long-lived process was still dialling the old IP, which was now a host that accepted the TCP connection (something else had been scheduled onto it) but had nothing listening on 5432 in a way that completed a Postgres handshake. That explains the half-finished TLS: the connection landed somewhere, just not where we wanted.
Second, the boxes where my manual nc "worked" were the freshly restarted ones, which had re-resolved on boot and picked up the new IP. The boxes where the app failed were the ones that had been running for days. My test was accidentally only ever hitting the healthy half. I'd built myself a perfect blind spot.
So: not the database, not the network, not auth. A resolver cache inside a long-lived process, plus a TTL that the process simply did not honour because it never asked again.
how I finally caught it
The thing that broke it open was the dumbest possible check, and I should have done it first. I ran, on an affected host:
getent hosts db.internal
# 10.4.2.11 db.internal <- correct, new IP
sudo ss -tnp | grep 5432
# ...ESTAB ... 10.4.2.7:5432 <- the OLD IP, from inside the app process
The OS resolved the new address. The application was talking to the old one. There it is.
The whole mystery collapsed into one line of ss output. The application's view of the world and the operating system's view of the world had diverged, and nothing in my higher-level tooling could see that gap because every tool I'd used resolved fresh each time.
the fixes
The immediate fix was a rolling restart, which forced every process to re-resolve. Within a couple of minutes the errors drained away. Boring, effective.
The real fixes were three changes, and I'll resist making them a tidy list because they're genuinely different in kind:
- We configured the pool to re-resolve periodically rather than caching forever. Most pools support this; ours did too, we'd just never set it.
- We stopped treating DNS as a cutover mechanism for anything that needed to happen quickly. For planned moves we now drain and restart deliberately, rather than hoping caches expire on their own. DNS is great for a lot of things and terrible as a fast switch.
- We added an alert on exactly the
sscheck above: a process holding an established connection to an IP that no longer matches the current resolution of its configured hostname. That would have paged us in seconds.
the lesson, such as it is
The mistake wasn't believing it might be the database, or the network, or auth. Those were all reasonable hypotheses given the evidence in front of me. The mistake was trusting tools that quietly did the right thing (re-resolving) when the broken component was doing the wrong thing (not re-resolving). My instruments were healthier than the patient, and they lied to me by being correct.
It is, as ever, always DNS. But more precisely it's almost never the DNS server. It's something downstream deciding it already knows the answer and never asking again. Caches are wonderful right up until the moment the thing they cached changes, and then they are a small, invisible time bomb sitting inside a process that looks completely healthy from the outside.
I've kept that ss | grep one-liner in a notes file ever since. When the database is fine and the network is fine and nothing makes sense, that's the line I run before I let myself believe anything more interesting.