Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

when in doubt, watch the wire

A service that claimed to be working but clearly wasn't, solved in ten minutes by tcpdump showing what was actually leaving the box.

A terminal showing a bug

There's a particular flavour of bug that survives every layer of logging you've ever added. The application says it sent the request. The downstream says it never arrived. Both are absolutely certain, both have logs to prove it, and neither is lying. The truth lives somewhere between them, on the wire, where your logs can't see.

This week's instance: a service that called out to an internal API, got nothing back, and timed out. The app logs said "POST to api.internal, awaiting response". The API's access log had no record of the request at all. Classic standoff. So I stopped reading logs and started reading packets.

tcpdump -i any -n -s0 'host api.internal and port 443' -w /tmp/cap.pcap

Then I triggered the call and looked at what actually left the box.

Code on a screen

The SYN went out. A SYN-ACK came back. The handshake completed. Then the very first TLS record went out, and the answer was a TCP RST from the other end almost immediately. Not a timeout, a reset. The connection was being torn down the instant we said hello in TLS.

That narrowed it enormously. A reset right after ClientHello is the network politely telling you the two ends can't agree on how to talk. In this case the downstream had been tightened to TLS 1.2 minimum a few days earlier, and the service in question was running an older library that, under a specific config, was still offering 1.0 at the top of its list. The far end took one look, decided it didn't like the company, and hung up. None of this appeared in either application's logs because as far as they were concerned the connection "failed", and failure is a single uninformative word.

The fix was a one-line config change to set the minimum TLS version on the client side too, which forced a sensible ClientHello. Confirmed it with another capture: handshake completed, application data flowed, the API's access log finally lit up. Ten minutes, most of which was me remembering the -s0 and the right BPF filter.

I keep coming back to the same lesson. Logs tell you what each component believed was happening. tcpdump tells you what was actually happening. When those two diverge, and in any interesting outage they will, the packets are the only honest witness in the room. It's not glamorous and the output looks like line noise until your eye tunes in, but I have lost count of the times it has turned a three-hour mystery into a ten-minute capture. Learn the filters. They pay rent.