Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the connection that hung at exactly the wrong size

A tunnel where small requests worked and large ones hung forever turned out to be an MTU and path-MTU-discovery problem, as it always is.

A terminal showing a network debugging session

The bug looked impossible. Over a new VPN tunnel, curl against an internal API returned tiny responses instantly but hung forever on anything large. ping was fine. SSH connected, then froze the moment a directory listing scrolled. Small things worked, big things did not, and nothing logged an error anywhere.

You learn to smell this one after a while. When the failure correlates with payload size rather than host or port, it is the MTU. It is always the MTU. Somewhere in the path a link could not carry full-sized packets, the "fragmentation needed" ICMP messages that should have told the sender to back off were being silently dropped by a firewall, and so path MTU discovery quietly failed. Large packets went into the tunnel and never came out.

The proof is a one-liner. Ping with the don't-fragment bit set and a payload just under the standard 1500 byte frame:

ping -M do -s 1472 10.0.0.1   # 1472 + 28 = 1500, fails
ping -M do -s 1400 10.0.0.1   # succeeds

The big one failed, the smaller one got through, and that was the whole story. The tunnel's effective MTU was lower than 1500 because of the encapsulation overhead, and PMTUD was broken by a filtered ICMP. I clamped the MSS on the tunnel interface so TCP would negotiate a size that actually fit, and everything that had hung came back to life at once.

Twenty years of networking and this exact fault has cost me, conservatively, a week of my life in aggregate. Write the ping test on a sticky note. It will pay for itself.