Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

when in doubt, put it on the wire and watch

A service that swore it was sending requests and a backend that swore it never received them, resolved by the oldest trick there is: capturing the actual packets with tcpdump.

A terminal full of tcpdump output scrolling past

There is a particular flavour of bug where two systems each insist, with complete sincerity, that the other one is lying. Our service logged that it was POSTing to an internal API. The internal API logged nothing, no request, no error, no 4xx, just silence. Both teams produced logs. Both sets of logs were correct. That is the worst kind of standoff, because everyone is telling the truth and the bug is in the gap between the two truths.

When application logs disagree, I stop trusting application logs. They tell you what the code thinks happened, which is precisely the thing in question. So I did what I always end up doing eventually, and went one layer down to watch the actual packets.

tcpdump -ni eth0 'host 10.0.4.12 and port 8080' -A

Nothing. Not a single packet to the backend on that port. Our service genuinely believed it was sending, the log line fired and everything, but no SYN ever left the box for 10.0.4.12:8080. So the request was not being dropped on the network and it was not being rejected by the backend. It was never going where we thought.

tcpdump showing connections to the wrong address

Widening the filter to just the host, without the port, told the whole story in one line. There were packets to 10.0.4.12, but to port 80, not 8080. The client was connecting to the wrong port and getting a connection refused, which the HTTP library swallowed into a retry that also failed, and the log line that said "sending request" fired before any of that actually happened. So our log was true in the narrowest possible sense and useless in every practical one.

The root cause was dull, as they always are once you can see them. A config value for the backend port had been set in two places, an environment variable and a config file, and a recent deploy had started reading the file, which still said 80 from some ancient default. The env var that everyone assumed was authoritative was being ignored. No amount of staring at the application would have shown me that. The packets showed me in seconds.

I reach for tcpdump more than almost any other tool when something networky goes wrong, and it has never once lied to me, because it cannot. It is not reporting an intention or a log statement or what a library meant to do. It is showing the bytes that did or did not leave the machine. When your two systems are each certain the other is at fault, the wire is the referee, and the wire does not have a side. Capture first, theorise second. It has saved me so many times that I no longer feel clever when it works, just slightly embarrassed that I spent twenty minutes reading logs before I ran it.