Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

when nothing made sense, the wire did

A "the API is broken" incident that turned out to be a load balancer silently dropping the request body, found only by capturing the actual packets with tcpdump on both sides.

A terminal scrolling packet captures next to a flashing error

Every few months a problem comes along where every layer of logging agrees with every other layer of logging, they all say the same plausible thing, and they are all wrong. This week's flavour: a service was rejecting requests with a validation error that said a required field was missing. The client team swore blind they were sending it. Our logs showed it arriving empty. Stalemate.

The honest position when two sets of logs disagree is that neither of them is the source of truth. They are both interpretations. The bytes on the wire are the truth, and the only way to read them is to stop trusting everyone's narrative and capture the actual traffic. So out came tcpdump, as it does roughly twice a year, usually after I have wasted an hour believing the logs first.

I ran it on the service host, filtered to the client's source and our port, and asked it to dump the payloads:

tcpdump -i any -A -s0 'tcp port 8080 and host 10.4.2.17'

The request arrived. The headers were fine. The body was empty. So far that matched our logs and damned the client. Except I have been burned enough times to not stop at the first confirmation, so I ran the same capture on the client side, before their packets reached our load balancer.

A side-by-side of two captures, one full and one truncated

The client was sending the body. A full, correct, JSON body, exactly as they claimed. Between their host and ours, it vanished. That narrows the suspect list to whatever sits in the middle, which in our case was the load balancer, and a recent change to it that nobody had connected to this because it "only touched headers".

It had not only touched headers. A rule had been added that rewrote the request to strip a header the upstream did not like, and the way it had been written re-buffered the request without forwarding the body for requests above a certain size. Small payloads sailed through. Large ones arrived headers-only, body silently discarded, and our service did exactly what it should: rejected a request with a missing required field. The validation error was correct. It was just pointing at a symptom three hops upstream.

tcpdump did not fix anything. It never does. What it did was end the argument, because once you can show both captures side by side there is nothing left to debate. The body left the client. It did not arrive at the service. The thing in between ate it. From there the fix was somebody else's config change to revert, and a note to never again trust two logs that agree until I have watched the packets myself.

I keep meaning to learn the wider tooling properly, the Wireshark dissectors and the rest. In practice the same three flags rescue me each time: -A to see the payload, -s0 so it is not truncated, and a tight filter so I am not drowning. That is usually enough to find the moment the story stops matching the bytes.