A service was timing out talking to a database. Not always, not predictably, just often enough that a dashboard went red a few times an hour and woke someone up. The application logs said connection timed out after 5000ms. The database logs said nothing at all, which is to say the database had no record of the connection ever arriving. Two sources of truth, both confident, completely contradicting each other. That gap between "the client tried" and "the server never heard" is exactly where I reach for tcpdump, because it doesn't have an opinion. It just tells you what crossed the wire.
I'd love to say I went straight to the capture. I didn't. I spent a good hour first doing the things everyone does: checking connection pool sizes, bumping the timeout from 5s to 10s, restarting the app on the theory that something was wedged, squinting at the database's max_connections. All reasonable, all useless, because none of them were the problem and I was guessing. There's a particular kind of pressure when a dashboard is red and people are watching, and that pressure pushes you towards activity rather than diagnosis. You change something, anything, because doing nothing feels worse, even though doing nothing thoughtfully is usually the better move. Eventually I got tired of guessing and did the thing I should have done first.
The reason I resist tcpdump for that first hour, every single time, is that it feels like a heavier instrument than the problem deserves. Surely a connection timeout is an application thing? Surely the logs will have it? They almost never do, because a log line is written by code that has a theory about what's happening, and the code's theory here was simply "I gave up after five seconds". That's true and it's useless. The packets carry no theory at all.
tcpdump -i any -n host 10.4.1.20 and port 5432 -w db.pcap
Capture on the app host, filtered to the database IP and the Postgres port, written to a file so I could open it properly later. I left it running until the dashboard went red again, which took about four minutes, and stopped it. Then I read it.
What I expected to see was a clean handshake: SYN out, SYN-ACK back, ACK, then the Postgres startup chatter. What I actually saw, on the failing connections, was this shape repeating:
10.2.0.5.51234 > 10.4.1.20.5432: Flags [S], seq 1234567890
10.2.0.5.51234 > 10.4.1.20.5432: Flags [S], seq 1234567890 (1s later)
10.2.0.5.51234 > 10.4.1.20.5432: Flags [S], seq 1234567890 (2s later)
10.2.0.5.51234 > 10.4.1.20.5432: Flags [S], seq 1234567890 (4s later)
SYN going out, retransmitted with the textbook exponential backoff, and never a single SYN-ACK coming back. The client was doing everything right. The packets were leaving the box. Nothing was answering. So the timeout was real, but it wasn't a database timeout in any meaningful sense, the database never even got a chance to be slow. The connection was dying before the handshake completed, which immediately rules out half the things I'd been poking at. Pool sizes don't matter if you never establish a socket.
So the SYNs left the app host and vanished. The question becomes: where? The same capture taken on the database host answered it. There, on the failing attempts, I saw nothing. No SYN arriving at all. The packet left one machine and never reached the other. That is a network problem, not an application problem, and now I had proof rather than a hunch, which changes the conversation entirely. You can argue with "the app feels slow". You cannot argue with a pcap showing a SYN that left and never landed.
The culprit turned out to be a stateful firewall in the path with a connection table that was full. Under load it was silently dropping new flows while happily passing established ones, which is exactly why the failures were intermittent and why existing long-lived connections were fine. New connections got dropped, old ones sailed through. The "5 second timeout" was just the client giving up after its SYN retries went unanswered. The firewall's own counters confirmed it once I knew to look: a steadily climbing drop count on a table that had been sized for a quieter version of this system, years ago, by someone who is not to blame because the traffic has tripled since.
The fix was boring: raise the table size, add monitoring on its utilisation, and put an alert on the drop counter so the next time it fills we find out before the dashboard does. Boring fixes are the good ones. I also added a little headroom on top of what the current traffic needs, because the previous sizing had been correct for its day and simply outgrown, and I'd rather not repeat someone else's perfectly reasonable mistake in three years' time when the traffic has tripled again.
The thing I keep coming back to is that none of this was discoverable from the logs. The application log was honest but uninformed, it knew it had failed and nothing about why. The database log was honest and absent, it genuinely had nothing to report. Both were telling the truth and both were useless, because the failure happened in the space between them, in the network, where neither of them can see. tcpdump lives in that space. It doesn't know what Postgres is, it doesn't care about your retry logic, it just shows you the packets, and packets don't lie about whether they arrived.
I have a small rule now, learned the slow way more times than I'd like to admit: if two components disagree about whether they're talking to each other, stop reading their logs and capture the traffic between them. The wire is the only neutral witness. It cost me an hour of guessing to relearn that this week. The capture itself took four minutes and told me more than the hour had.