Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

the network was fine until the packets got big

A homelab connection that pinged perfectly and loaded small pages but hung on large transfers, traced to an MTU mismatch eating the packets that mattered.

A patch panel and a tangle of network cables in a homelab rack

The symptom was maddening because it was so selective. Ping worked. SSH connected. Small web pages loaded instantly. But pull anything substantial, a git clone, a file copy, a page heavy with images, and the connection would hang stone dead partway through and just sit there until it timed out. A link that works for little things and dies on big things is, nine times out of ten, MTU.

The maximum transmission unit is the largest packet a link will carry. Standard Ethernet is 1500 bytes. The trouble starts when something in the path, a VPN tunnel, a PPPoE link, an overlay network, a switch someone set to jumbo frames and forgot, wants a different number, and the two ends disagree. Small packets sail through under the limit. The first genuinely large packet hits the bottleneck, is too big to forward, and is supposed to trigger an ICMP "fragmentation needed" message back to the sender so it can shrink. Except half the firewalls on earth block that ICMP, helpfully, on the grounds that ICMP is "dangerous". So the packet vanishes, no message comes back, and the sender keeps cheerfully retransmitting the same too-big packet forever. This is a black hole, and it is why your transfer hangs rather than fails honestly.

A rack of networking gear standing in for the path the packets take

The way to catch it is to send a packet of a known size and forbid fragmentation, then watch where it stops getting through. On Linux:

$ ping -M do -s 1472 192.168.1.1   # 1472 + 28 bytes header = 1500, works
$ ping -M do -s 1473 192.168.1.1   # one byte over, and...
ping: local error: message too long, mtu=1500

-M do sets "don't fragment", -s is the payload size, and you add 28 bytes for the ICMP and IP headers to get the real packet size. Walk the size up until packets stop getting through, and the last size that worked, plus 28, is your real path MTU. In my case the tunnel in the middle wanted 1492, not 1500, and everything between 1465 and 1500 bytes was quietly disappearing into the void.

The fix depends on where the mismatch lives. If it's a link you control, set the interface MTU to match the real ceiling:

$ sudo ip link set dev eth0 mtu 1492

If it's a tunnel, the usual move is MSS clamping, where the router rewrites the TCP maximum segment size on the way through so neither end ever tries to send a segment too big for the path. On a Linux router that's a single iptables rule that pins MSS to the path MTU, and it's the thing that makes PPPoE links behave instead of mysteriously stalling.

The reason I call MTU a silent killer is that nothing in the symptoms points at it. There's no error in the logs, ping is green, the dashboard is happy, and the only evidence is "big things hang". You can lose an evening to it easily, because every instinct says check DNS, check the firewall rules you can see, check the application, when the actual fault is a number two layers down that nobody set on purpose. Now it's the first thing I check when a link works for small things and dies on large ones. One ping -M do and you either rule it out in ten seconds or you've found your culprit.