when nobody believes the network, run tcpdump

A terminal scrolling packet captures

There is a particular kind of outage where two teams are each certain the other is at fault, and both have logs that prove it. The application team's logs say "connecting to database, connection established, query sent, no response". The database team's logs say "we never saw a connection from that host". Both are telling the truth as their software understands it. The truth as the wire understands it is the only one that settles the argument, and the wire does not keep logs. It keeps packets, for as long as you are watching. So I started watching.

This is the moment I reach for tcpdump, and it has bailed me out enough times that I no longer feel any embarrassment about dropping to it early. Application logs tell you what the application believes happened. tcpdump tells you what actually crossed the interface. When those two stories disagree, the application is wrong, every time, because the application is at least one abstraction layer above the place the lie is being told.

Capturing on both ends

The single most useful move is to capture on both ends at once. One capture tells you what left or arrived at one machine. Two captures tell you what got lost in between, which is usually the whole question.

On the application host, watching traffic to the database:

tcpdump -ni eth0 host 192.0.2.50 and port 5432 -w app-side.pcap

On the database host, watching for traffic from the application:

tcpdump -ni eth0 host 198.51.100.10 and port 5432 -w db-side.pcap

A few flags earn their place. -n stops it doing DNS lookups, which both slows the capture and, more annoyingly, generates its own traffic that clutters the result. -i pins the interface so you are not capturing the wrong one on a multi-homed box. -w writes raw packets to a file so I can pull both captures back to my laptop and read them side by side, rather than squinting at two SSH sessions. And the filter, host ... and port ..., keeps the file small and the signal high on a busy machine.

A close-up of a TCP handshake, half-completed

Reading the handshake

With both files in front of me, the story took about thirty seconds to read. On the application side I could see the outbound SYN, neat and well-formed, leaving for the database on port 5432. On the database side, that SYN simply never appeared. No SYN, no SYN-ACK, nothing. The packet left one machine and did not arrive at the other.

That immediately rules out an enormous amount of speculation. It is not the database refusing connections, because a refusal is an active RST and there was no RST. It is not a slow query or a connection-pool exhaustion, because the connection never got as far as being established. It is not DNS, because the application was clearly dialling the right address. The packet was being dropped somewhere in the path between the two hosts, before completing the handshake, and now I had a much smaller place to look.

The half-open handshake is the tell. A TCP connection opens with SYN, then SYN-ACK, then ACK. If you see the SYN leave but never come back as a SYN-ACK, and the far end confirms it never saw the SYN at all, the loss is outbound and in the network. If the far end did see the SYN and replied with SYN-ACK but your side never received that, the loss is on the return path. Two captures turn "the connection is failing" into "the loss is in this specific direction", which is the difference between an afternoon and a coffee.

What it actually was

A security group, predictably. Someone had tightened the database's ingress rules earlier that week and the new rule allowed the wrong source range, so the SYN was being dropped at the database host's network boundary before tcpdump on the instance itself could even see it. That last detail mattered: because the drop happened upstream of the interface I was capturing on, even the database-side capture showed nothing, which is itself a clue. A packet that vanishes before reaching the interface of the machine it was addressed to has been killed by something in the path, and on a cloud network that something is nearly always a firewall or security-group rule rather than the host.

The fix took a minute. The diagnosis took ten, most of which was setting up the captures. Without them I would have spent the afternoon in a meeting where two teams read their respective logs at each other, each correct and each useless.

I keep coming back to the same principle. Logs are an interpretation. tcpdump is an observation. When an interpretation and an observation disagree, trust the observation, and when two teams disagree about the network, the fastest path to peace is not a better argument. It is a packet capture that nobody can talk their way around.