Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

it was the mtu, it's always the mtu

A backup job that hung at exactly the same byte every night turned out to be a path MTU problem hiding behind a working ping.

A terminal full of packet captures

The symptom was maddening because everything looked fine. SSH worked. ping worked. Small scp transfers worked. But the nightly backup over the same link would crawl, then hang, always around the same point. Not a fixed byte count, but always once the connection got busy and the packets got big.

Big packets. That's the tell, and I should have spotted it sooner. Small things go through, large things don't, and a plain ping is small. So I asked it to stop being small:

ping -M do -s 1472 backup.internal

-M do sets the don't-fragment bit, -s 1472 plus 28 bytes of headers gives you a 1500-byte frame. It failed instantly with "message too long". Drop it to 1400 and it sails through. There was a tunnel in the path eating into the MTU, and nothing was telling the endpoints to back off, because the ICMP "fragmentation needed" messages were being firewalled into the void. Classic path MTU discovery black hole.

The fix was a one-liner clamping the MSS on the gateway, and the backups have run clean every night since. I have lost more hours of my life to MTU than to almost any other single cause, and every time it presents as something else entirely: a slow database, a flaky VPN, a website that loads everything but the big images. So now it is the first thing I check, not the last. If a connection works for small payloads and dies on large ones, stop guessing. It was the MTU. It's always the MTU.