it wasn't the database, it wasn't the network, it was dns again

A terminal full of error output

The application started throwing connection errors to the database at about half nine in the morning. Not all of them, not all at once, just a rising tide of timeouts that got worse over twenty minutes until roughly a third of requests were failing. The database itself was fine. Its own dashboards were green, its load was normal, it was sitting there perfectly healthy and wondering why fewer people were talking to it. So naturally we spent the first half hour looking at the database, because that's where the errors said the problem was, and the errors were lying.

The shape of the lie

The application logs said connection timeout to the database's hostname. That's an honest-looking error and it sends you straight to the wrong place. A connection timeout means "I tried to reach this host and got nowhere", and the obvious reading is that the host is down or the network between us is broken. Both of those were false. The host was up, the network was fine, and yet a third of our app servers genuinely could not reach the database.

A third. That number turned out to be the entire clue, and we walked past it for ages. If the database were down, it'd be all of the app servers, not a third. If the network were partitioned, it'd be a clean split along some topology boundary, not a scattered third. A fraction that doesn't map to any physical boundary is the signature of something per-host and stateful. Something each server carries its own copy of and gets independently wrong.

Something like a cache.

Source code on a monitor, mid-debugging session

What had actually happened

The database had been failed over to a standby during a maintenance window the night before, which is routine and went fine. The DNS record for the database's hostname was updated to point at the new primary's address. Also routine. The TTL on that record was five minutes, so within five minutes every resolver in the building should have picked up the new address and moved on.

Most did. Some didn't, because of a thing I'd half-known and never properly internalised: the TTL is a request, not a command. A resolver, or a caching layer, or an application runtime that does its own name caching, is free to hold a record longer than its TTL, and several things in our stack did exactly that. The JVM in particular, on the servers running the older config, was caching DNS resolutions for the lifetime of the process by default, because someone long ago set the security property that does that and nobody had revisited it. Those servers had resolved the database hostname once, at startup, weeks ago, to the old address, and were never going to look again.

So the "third of servers" was precisely the set that had been started before the failover and were pinned to a dead address. They were dutifully connecting to where the database used to be, getting nothing, and timing out. The other two-thirds had either restarted recently or had a sensible cache TTL, and were happily talking to the new primary.

$ getent hosts db.internal           # on a healthy box
10.4.2.17   db.internal
$ getent hosts db.internal           # on a sick box, via the app's own resolver
10.4.2.9    db.internal              # the OLD primary, gone since last night

That second line is the whole outage in one command. Same hostname, two answers, depending on who you ask and when they last bothered to ask.

A wall of log lines scrolling past during an incident

Fixing it, and then fixing it properly

The immediate fix was a rolling restart of the affected app servers, which threw away their pinned caches and let them resolve fresh. Service recovered within minutes. That's the firefighting, and it's the part that doesn't actually fix anything; it just makes the symptom stop until next time.

The real fix had two parts. First, the JVM DNS cache: we set networkaddress.cache.ttl to a sane value, thirty seconds, so the runtime would actually honour DNS changes rather than treating its first lookup as gospel forever. That alone would have turned this outage into a thirty-second blip nobody noticed.

Second, and more philosophically, we stopped pretending DNS TTLs guarantee anything about propagation. For something as important as the database endpoint, hoping that every layer of caching across the estate respects a five-minute TTL is wishful thinking. Where we could, we moved to failover mechanisms that don't depend on every client noticing a name change in a timely way: a virtual IP that moves with the primary, so the address stays put even when the machine behind it changes. DNS is wonderful for a lot of things, but using it as a real-time failover signal asks it to be fast and consistent, and it is reliably neither.

The half-hour we lost looking at the database

I want to dwell on that first half hour, because it's where the real lesson lives, and it's not a technical one. We looked at the database first because the error message told us to, and the error message was technically accurate and completely misleading. "Connection timeout to db.internal" is true: the connection did time out. It just doesn't tell you why, and the why was three layers below the message, in a name resolution that happened weeks earlier on machines nobody was looking at.

The trap is that a precise-sounding error feels like a lead, and a lead feels like progress, so you follow it. We checked the database's connection count, its slow query log, its replication lag, its disk, all green, all healthy, and each green check felt like ruling something out when really it was confirming we were in the wrong place. The thing that finally broke us out of it wasn't a clever idea, it was someone stepping back and asking the dumb question: why a third? Not which third, just why a third at all. The answer to "why a fraction with no physical boundary" is almost always per-host state, and per-host state in a connection problem is almost always a cache. From there it was minutes.

I've tried to make that a habit since. When an outage hits a strange fraction of a fleet, I now ask what each member of the fleet carries its own private copy of, before I ask anything about the shared thing they're all failing to reach. Caches, connection pools, resolved addresses, locally pinned config: the stuff that's identical in source but divergent at runtime. That category answers a surprising number of "but the server's fine" mysteries.

The lesson, again, in yet another costume: when the error says "I can't reach X" and X is demonstrably fine, stop trusting the address. Ask each affected machine what address it's actually using, with that machine's own resolver, the way the application sees it. The day it's a database outage and the day it's a network outage, half the time, it's DNS holding an answer well past its welcome.