when in doubt, look at the wire

A terminal mid-capture, packets scrolling past a bug icon

Once every few hundred requests, a call from one of our services to another would time out. Not slow, not an error code, just nothing back until the client gave up. Both sides logged cheerfully that everything was fine. The caller logged "request sent". The callee, when you could correlate it at all, logged "request handled, response written". Two services, both certain they had done their job, and a user staring at a spinner. When the application layer is unanimous that nothing is wrong and something obviously is, it is time to stop believing the application layer and go look at the wire.

The logs are an opinion; the packets are the truth

This is the thing I keep relearning. Logs tell you what the application thinks happened. They are written by the same code whose behaviour you are doubting, so they share its blind spots. If the code believes it sent a response, it logs that it sent a response, whether or not a single byte made it onto the network. tcpdump does not have opinions. It tells you what was actually on the wire, in what order, with what flags, and when. For a problem that lives in the gap between two services, that is the only honest witness.

So I got a capture on the callee, filtered down to the conversation I cared about, and waited for it to misbehave.

tcpdump -i any -nn -w /tmp/svc.pcap \
  'host 10.0.4.21 and port 8080'

Capture to a file rather than to the terminal, so I could open it in Wireshark later and follow streams properly. -nn to stop it wasting time on DNS and port-name lookups. Then leave it running until the timeout reproduced, which on a busy service took only a few minutes.

A close-up of a captured stream showing the RST mid-conversation

What the capture actually showed

A healthy request looked exactly as you would hope. SYN, SYN-ACK, ACK, the request bytes, the response bytes, a clean FIN on both sides. Textbook.

The failing one started the same way. Connection established, request sent, and then, before the response came back, an RST. A reset. Something had torn the connection down mid-flight, and crucially the reset did not come from either application. The callee's own logs showed it writing the response after the time the reset appeared on the wire, which is the tell: the application was happily writing into a socket that had already been killed underneath it. That is why both sides logged success. From the callee's point of view it wrote the response. From the caller's point of view it never arrived. Both were telling the truth about their half of a connection that no longer existed.

So who sent the reset? Not the two services. That narrowed it to the things in between that an application never sees and rarely thinks about: a load balancer, a firewall, conntrack, anything keeping per-connection state.

The middle of the network has opinions too

The pattern, once I had it in front of me, was an idle-timeout reset. The failing requests were the ones where the callee took a little longer to respond, usually because the request happened to need a slower path. When the response took longer than a certain window, a stateful device in the middle decided the connection had gone idle, dropped it from its table, and reset it. Fast responses beat the timer and were fine. Slow ones tripped it. That is exactly the "works under light load, fails when busy or slow" signature that makes these bugs so good at hiding, and exactly the kind of thing no amount of staring at application code will ever reveal, because the application is not the one sending the reset.

The fix had two parts, and I would not have known which knobs to turn without the capture. First, the immediate relief: the stateful device's idle timeout for that path was shorter than our slowest legitimate response, so it was raised to sit comfortably above it. Second, the proper hardening: we enabled TCP keepalives on the connection so that even an otherwise-idle connection sends the occasional packet, resets the middlebox's idle timer, and stops it deciding the connection is dead.

net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 4

A keepalive every sixty seconds is nothing to the network and plenty to convince a middlebox the connection is alive.

Reading a capture without drowning in it

The thing that puts people off tcpdump is the wall of output, and it is a fair worry: capture everything on a busy interface and you will get megabytes of noise in seconds, most of it irrelevant. The trick is to filter hard at capture time, before the packets ever hit disk, using a BPF expression that is as specific as you can make it. host and port get you most of the way. Add and tcp[tcpflags] & tcp-rst != 0 when you only care about resets, and the capture shrinks to just the moments things went wrong.

For following an actual conversation, though, I almost always capture to a .pcap and open it in Wireshark afterwards rather than squinting at the terminal. "Follow TCP Stream" reassembles the two directions into a readable transcript, and the flag and timing columns make a mid-flight reset jump out immediately. tcpdump is the right tool to capture on a server where you cannot run a GUI; Wireshark is the right tool to read what you captured. They are a pair, and using them together is far less painful than trying to parse hex on a serial console at midnight.

One small habit that has saved me repeatedly: capture on both ends when you can. A reset that appears on the callee but not on the caller, or arrives at a different time, tells you the reset was injected somewhere in between, which is exactly how I knew a middlebox was involved here rather than either application. A single-ended capture would have shown me the reset but left me guessing where it came from. Two captures, lined up by timestamp, drew an arrow straight at the thing in the middle.

Why I keep coming back to it

There is nothing clever about tcpdump. It is old, it is everywhere, and it does one thing: it shows you what is really happening on the network, unmediated by anyone's idea of what should be happening. Every time I have a problem that lives between two components, where each one swears it is innocent, the answer is the same. Stop reading the logs, which are just both programs' opinions of themselves, and capture the packets, which are the only neutral account of events you will get.

It has saved me from chasing imaginary application bugs more times than I can count. This was just the most recent. When two services disagree about reality, the wire is the referee.