tcpdump saved me again

A terminal full of packet output

The application logs said the request had been sent. The downstream logs said nothing had arrived. Both teams were certain, both were polite about it, and the request was still missing. This is the exact shape of problem where I stop reading dashboards and start reading packets.

The symptom was a payment callback that worked perfectly in staging and failed roughly one time in twenty in production. One in twenty is the worst ratio there is. Too frequent to dismiss as a fluke, too rare to reproduce on demand. The retry logic papered over most of it, but a small number of callbacks were being delivered twice downstream, which for a payment system is not a rounding error you want to explain to anyone.

start where the truth is

Logs lie. Not maliciously, they just describe what the application thinks happened, which is the application's intentions rather than what actually went down the wire. The kernel does not have intentions. So I got onto the box that makes the outbound call and started small.

tcpdump -i any -n -w callback.pcap 'host 10.4.18.22 and port 443'

I deliberately wrote to a file rather than watching it live. When you are chasing a one-in-twenty event you cannot sit there reading scrollback hoping to catch it. Capture to disk, let it run, and pull the file apart afterwards in Wireshark where you have filters and colour and the luxury of time. The -n keeps tcpdump from doing reverse DNS on every address, which both speeds it up and stops it adding its own latency to your observations.

A word on -i any. It is convenient and I reach for it constantly, but it captures with a slightly different link-layer header and it will happily show you the same packet twice if it traverses multiple interfaces. For this job I wanted the truth on one specific path, so once I had a feel for the traffic I pinned it to the actual egress interface and re-ran.

Scrolling packet capture on a terminal

the pattern in the bytes

After about forty minutes I had a capture with three of the failed callbacks in it. I opened it up and filtered on one of the destination tuples. The good requests looked exactly as you would expect: SYN, SYN-ACK, ACK, the TLS handshake, the request, a clean response, a polite FIN dance at the end.

The bad ones diverged in a very specific place. The handshake completed, the request went out, and then, before any response, an RST came back from the server side. Not a timeout. Not a connection refused. A connection that was established, accepted our data, and was then torn down with a reset.

That changes the entire diagnosis. A timeout points you at the network or an overloaded backend. An RST mid-conversation points you at something on the receiving end actively deciding to hang up. The packet capture also gave me the timing: the RST arrived around 30 seconds after the request, every time, give or take a few milliseconds. Thirty seconds is not a number that appears in nature. It is a number someone typed into a config file.

the thirty-second tell

Thirty seconds is the default idle timeout on a great many load balancers and proxies. The downstream team sat behind one. Their actual application was slow to respond on certain callback types, the ones that did a synchronous fraud check, and on the slow path it occasionally crept past the proxy's idle timeout. The proxy, seeing no bytes for thirty seconds, reset the connection. Our client saw the RST, treated the call as failed, and retried. Sometimes the original slow request had in fact completed on the far side just after the reset, so the retry produced the duplicate.

None of that was visible from either side's application logs, because from the application's point of view nothing interesting happened. Our side logged "request failed, retrying", which is true and useless. Their side logged the request twice and assumed the client was buggy, which is also true and useless. The proxy in the middle was the only one who knew, and it does not write to anyone's Kibana.

The fix was unglamorous, which is how you know it is the right one. They raised the idle timeout to comfortably exceed the worst-case fraud-check latency, we added an idempotency key to the callback so a duplicate delivery became a no-op, and we set our client timeout to be shorter than the proxy's so that we gave up first and controlled the retry rather than having the connection yanked out from under us. Belt and braces, but the belt alone would have done.

why I keep coming back to it

I have access to distributed tracing, structured logging, metrics with more cardinality than I can afford, and a genuinely good APM tool. I use all of them and I am grateful for them. But every one of those is a story the software tells about itself, and when two pieces of software disagree about reality, you need an observer that does not care about either of their feelings.

tcpdump is that observer. It has been on every Unix box I have touched for my entire career, it needs nothing installed, it does not care what language your service is written in, and it reports what the network interface saw rather than what anyone hoped it saw. A capture is also wonderfully unarguable in a meeting. "The log says we sent it" invites debate. "Here is the RST, with a timestamp, arriving thirty seconds after our request, from your IP" tends to end it.

So yes, tcpdump saved me again. It usually does. If there is a lesson buried in here beyond "learn your packet tools", it is this: when two systems each insist the other is at fault, the bug is almost always in the gap between them, in some piece of infrastructure neither team thinks of as theirs. Go and capture the gap.