The symptom was a service "timing out" talking to a downstream API, intermittently, in a way that survived every higher-level explanation I threw at it. The application logs said connection timeout. The metrics dashboard agreed and drew a sad little spike. The other team swore their API was healthy and pointed at their own green dashboard. Everyone was looking at their own instruments, all of which were technically true and none of which told me where the packets were actually going. So I did what I always end up doing: I stopped reading dashboards and went to the wire.
tcpdump -nn -i any host 10.20.4.17 and port 443
A few minutes of that during a failing window and the story was on the screen in plain text. We were sending SYN packets and getting nothing back. Not a RST, which would mean something actively refusing us. Not a slow response, which would mean the API was busy. Just silence: SYN, retransmit, retransmit, retransmit, then the application giving up and logging its honest-but-useless "timeout".
Silence after a SYN is a specific accusation. The application wasn't slow and the API wasn't refusing us; our packets were leaving and not coming back, which points at the path, not the endpoints. That reframed the whole problem. I'd been treating it as "the API is flaky" and it was actually "some of our connections never reach the API at all."
The next step was to find out which of our connections. We sit behind a NAT with a pool of source addresses, and once I had the SYNs in front of me I could see the failing ones were clustered. Cross-referencing the source ports against the conntrack table told the rest of it: the NAT box was exhausting its available source ports to that single destination under load, and new connections were being silently dropped because there was no free tuple to map them onto. Not a firewall rule, not the API, not the application. A capacity limit on a box nobody had thought about, behaving exactly as configured, with no log line anywhere to announce it.
The fix was a config change on the NAT, a larger port range and shorter timeouts on idle mappings, and the spikes went away. But the fix isn't the point of writing this down. The point is the reflex.
Every layer in that stack reported what it locally observed, and every report was accurate and misleading at the same time. The application correctly saw a timeout. The API was correctly healthy. Both dashboards were honest. The only thing that doesn't editorialise is the packet capture, because it shows you what genuinely traversed the interface rather than what some component concluded about it. When the explanations from different layers contradict each other, the wire is the referee, and tcpdump is how you ask it. It has bailed me out so many times now that "what does the capture say" has become the first question I ask rather than the last, and I'm always faster for it.