Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

when the logs lie, tcpdump tells the truth

An intermittent timeout that every log blamed on something else, until a packet capture showed who was actually staying quiet.

A bug on a terminal

Intermittent timeout between two services. The client logged "connection timed out" and pointed at the server. The server logs showed nothing at all, no request, no error, which the client took as proof the server was broken and the server took as proof the request never arrived. Both true. Neither useful. The application logs had run out of things to say, so I stopped reading them and started watching the wire.

tcpdump does not have opinions. It shows you what actually crossed the interface, which is exactly what you need when two log files are each blaming the other.

tcpdump -ni eth0 host 10.0.4.12 and port 8080 -w /tmp/cap.pcap

I left that running on the server and waited for the intermittent failure to recur, which, being intermittent, it obligingly did about ten minutes later.

Packets and code

what the capture actually showed

The client's SYN arrived. The server sent a SYN-ACK. Then the client went quiet, no final ACK, and a few seconds later started retransmitting SYNs from scratch. So the connection was half-opening and dying in the handshake, on the client's side, despite the client being the one shouting about a broken server.

That ruled out a whole list of suspects in one go. The server was reachable. It was listening. It answered. The failure was the client never completing the handshake, which is not an application bug at all, it is the network stack or something below it eating the return packet.

Reading the SYN-ACK retransmit pattern told the rest of the story:

10:14:02.118  IP client > server: Flags [S], seq 100
10:14:02.118  IP server > client: Flags [S.], seq 900, ack 101
10:14:03.119  IP server > client: Flags [S.], seq 900, ack 101   # retransmit
10:14:05.121  IP server > client: Flags [S.], seq 900, ack 101   # again

The server was re-sending its SYN-ACK because the ACK never came back. The return path was dropping packets. In this case a flapping path through an overloaded box that was fine for small traffic and quietly shedding under bursts, which is why it only ever failed intermittently and never showed up in a healthy-looking ping.

why I keep reaching for it

The lesson is the same one every time, which is presumably why I keep relearning it. Application logs tell you what the application thinks happened. They are written by the same code that is failing, from inside the failure, so they inherit its blind spots. The client genuinely believed the server timed out, because from where it sat, that is what a missing ACK looks like.

tcpdump sits below all of that. It does not know what your code intended, it just reports the packets, and packets do not lie about whether they arrived. When two layers are pointing fingers and you cannot tell who is right, drop underneath both of them and look at the wire. It has rescued me more times than I would like to admit, and it did it again today.