Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

it was the mtu, it's always the mtu

A connection that handshook fine and then hung mid-transfer turned out to be a classic MTU mismatch over a tunnel, with broken path MTU discovery hiding it.

A bug crawling across a terminal screen

The symptom was the worst kind: it worked. Mostly. You could SSH into the box, you could run ls, you could read a short file. Then you tried to scp anything of size, or cat a real log, and the session froze stone dead halfway through. Ping was perfect. The TCP handshake completed instantly. Small requests flew. Large ones vanished into the void.

I want to write this down because I lost the best part of an afternoon to it, and because the answer was the answer it is always going to be. It was the MTU. It is always the MTU.

the shape of the problem

The thing that should have tipped me off sooner was the asymmetry between "connects fine" and "dies under load". When the handshake works but bulk transfer hangs, you are almost never looking at a firewall rule or a routing problem, because those tend to fail cleanly and immediately. You are looking at something that only bites once the packets get big.

Small packets fit. The SYN, the SYN-ACK, the first few hundred bytes of an interactive session: all tiny, all under any plausible MTU, all delivered. Then TCP ramps up, starts sending full-size segments to fill the window, and those full-size segments are too big to cross some link in the path. They get dropped. Silently. The connection stalls, retransmits the same too-big packet, gets it dropped again, and sits there until something times out. From the user's chair it just hangs.

proving it

You do not have to guess at this. Ping with the don't-fragment bit set and a payload size, and walk it down until packets start getting through:

ping -M do -s 1472 10.0.0.5

1472 bytes of payload plus 28 bytes of ICMP and IP headers is 1500, the standard Ethernet MTU. On this path that came straight back with:

ping: local error: Message too long, mtu=1500

no, worse, it came back with the frozen, no-reply version. Drop the size and try again:

ping -M do -s 1422 10.0.0.5

That one worked. So somewhere in the path the usable MTU was around 1450, not 1500. The missing fifty-odd bytes is the tell. That is roughly the overhead of a tunnel header, and this traffic was crossing a GRE tunnel between two sites. GRE eats about 24 bytes, IPsec more, and nobody had told the interfaces.

A close-up of packet capture output on screen

The reason this is invisible day to day is path MTU discovery. In a healthy network the router that cannot forward the oversized packet sends back an ICMP "fragmentation needed" message, the sending host shrinks its segment size, and everything carries on a little smaller and none the wiser. The whole mechanism is built to handle exactly this. It works beautifully, right up until somebody, somewhere, decides ICMP is scary and blocks it at a firewall.

And of course somebody had. The ICMP "fragmentation needed" packets were being dropped by an overzealous rule a couple of hops away, so the sender never got the memo. It kept cheerfully firing 1500-byte packets into a 1450-byte tunnel, and the tunnel kept quietly binning them. This is the famous PMTUD black hole, and it is responsible for more "the website loads but the download hangs" tickets than I care to count.

the fixes, least to most polite

The blunt fix is to lower the MTU on the tunnel interface so it stops trying to send packets that cannot fit:

ip link set dev gre0 mtu 1400

That works and it is honest. Everything sent over that interface is now small enough to cross the path without fragmentation, ICMP or no ICMP.

The cleverer fix, and the one that survives someone forgetting the interface MTU later, is to clamp the TCP maximum segment size at the router so every TCP session negotiates a sane segment size up front:

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

MSS clamping rewrites the MSS option in the SYN as it passes through, so both ends agree to keep their segments small enough to fit before they ever send a full-size packet. It only helps TCP, but TCP is what was hurting, and it does not depend on ICMP getting through.

The properly polite fix is to go and find whoever is dropping ICMP type 3 code 4 and have a quiet word, because they have broken a load-bearing part of the internet to feel safe. But that is a longer conversation and usually involves a change request.

the lesson, again

Every few months I convince myself I have internalised this and will spot it instantly next time. I never do. The shape is always slightly different: a VPN, a GRE tunnel, a cloud overlay network, jumbo frames on one switch and not the next, a VXLAN somebody stood up last week. The symptom is identical every time. Handshake fine, small stuff fine, big transfers hang. Ping with don't-fragment, walk the size down, watch where it falls over, count the missing bytes.

It is always the MTU. Write it on a sticky note. I have.