Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

it was the mtu, it's always the mtu

A day lost to a tunnel that worked for small requests and hung for large ones, and the path MTU discovery black hole that was quietly eating the packets.

A terminal showing a network bug being chased

The symptom was the kind that makes you doubt your own senses. A service behind a new VPN tunnel worked perfectly for some requests and hung stone dead for others, with no error, no reset, just a connection that established, sent a bit, and then sat there until something gave up. curl against a small endpoint: instant. curl against an endpoint that returned a few kilobytes of JSON: hang. Same host, same port, same tunnel. The only difference was size.

If you've done this for a while, the back of your neck is already prickling, because you know how this ends. It was the MTU. It's always the MTU. But knowing the genre doesn't spare you the act, so here is how the day actually went.

the wrong suspects

I started where everyone starts, which is to say I blamed the application. Maybe a timeout, maybe a buffer, maybe some streaming write that stalled on a particular response shape. I read the handler. I added logging on both ends. The server cheerfully logged that it had written the full response and closed the connection. The client logged that it had received the headers and then nothing. So the bytes left one box and did not arrive at the other, and the application on both ends believed it had done its job. That is the moment you stop looking up the stack and start looking down it.

packets don't lie

tcpdump is the tool that ends these arguments. On the sending side I could see the large response going out, segment after segment. On the receiving side, the first few segments arrived, an ACK went back, and then silence. The sender retransmitted the next segment. Silence. Retransmitted again. Silence. The connection wasn't reset; it was starving, retransmitting the same segment into a hole and never getting an answer.

A packet capture being read on screen

The tell was the size. The segments that got through were small. The one that vanished, every time, was full-sized: 1500 bytes on the wire, give or take headers. A 1500-byte frame works fine on a normal Ethernet path. It does not fit through a tunnel, because the tunnel wraps every packet in its own headers and the usable payload shrinks. The encapsulation overhead means the effective MTU inside the tunnel is lower, often 1400 or so, sometimes less.

So what is supposed to happen is path MTU discovery. The host sends a big packet with the don't-fragment bit set, a router on the path says "too big, max is 1400", sends back an ICMP "fragmentation needed" message, and the sender backs off to a smaller size. The whole mechanism is elegant and it depends entirely on that ICMP message getting home.

the black hole

It wasn't getting home. Somewhere on the path, a firewall was dropping ICMP, because someone years ago decided ICMP was "ping" and ping was "a security risk" and blocked the lot. So the sender fired off its 1500-byte packet, the tunnel ingress refused it for being too large, sent back the ICMP "too big" that would have fixed everything, and that ICMP died at a firewall. The sender never learned. It kept sending full-sized packets into a path that would not carry them, retransmitting forever, while small packets sailed through and made everything look almost healthy. This is a path MTU discovery black hole, and it is one of the most reliably miserable failure modes in networking precisely because it half-works.

You can prove it in one line. Ping with the don't-fragment bit set and a payload that pushes you over the tunnel's limit:

ping -M do -s 1472 10.0.0.5
# ping: local error: message too long, mtu=1400

Step the size down and find the cliff. Below it, replies. Above it, nothing. That cliff is your real MTU, and it told me the tunnel's usable size was 1400, not the 1500 the interface was advertising.

Network configuration on a terminal

fixing it without begging the firewall team

The clean fix is to stop dropping ICMP "fragmentation needed", because that message is not an attack, it is the network trying to do its job. But the firewall belonged to another team and "please stop blocking ICMP type 3 code 4" is a conversation measured in weeks. So I reached for the pragmatic fix that lives on my own kit: MSS clamping.

TCP negotiates a maximum segment size at handshake. If you clamp that value down to fit the tunnel, neither end ever tries to send a segment too big to pass, and path MTU discovery never needs to fire at all. On the tunnel interface:

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

--clamp-mss-to-pmtu sets the MSS to fit the outgoing interface's MTU automatically, so I didn't even have to hardcode the magic number. The hang disappeared instantly. Large responses flowed. The capture showed full-sized-for-the-tunnel segments, all acknowledged, no retransmit storm.

the lesson, again

I will have forgotten this by the next time, and so will you, which is why we keep writing it down. When a connection works for small payloads and hangs for large ones, it is the MTU until proven otherwise. Reach for tcpdump early, because the application logs will lie to you with a clear conscience: both ends genuinely believe they did everything right. The truth is in the packets, where a full-sized frame is going out and quietly not arriving, while a firewall somewhere swallows the one message that would have explained why. It was the MTU. It's always the MTU. Put the kettle on and clamp the MSS.