The bug report was the worst kind: "it works, but sometimes it doesn't." SSH was fine. ping was fine. Browsing small pages was fine. But pulling a large file over the link, or loading anything image-heavy, would stall partway and then time out. No errors in the logs, no dropped interface, no packet loss on a plain ping. Everything healthy, and yet clearly not.
This is the signature of an MTU problem, and once you have been bitten by it once you start to recognise the shape. The reason it is such a silent killer is that the things you reach for first, ping and a quick SSH login, are exactly the things that are too small to trigger it.
why small things work and big things don't
The maximum transmission unit is the largest payload an interface will put on the wire in a single frame. On plain Ethernet that is 1500 bytes. The moment you wrap traffic in a tunnel, a VPN, PPPoE, GRE, a VXLAN overlay, you spend some of those bytes on the encapsulation header. So the usable MTU inside the tunnel drops, often to 1492, 1460, 1400, or lower.
TCP is supposed to cope with this. During the handshake each side advertises an MSS (maximum segment size), and the smaller wins. Where it falls apart is Path MTU Discovery. The host sends a full-size packet with the Don't Fragment bit set, a router in the middle finds it too big for the next hop, and it is meant to reply with an ICMP "fragmentation needed" message telling the sender to back off. If that ICMP message is dropped, and a depressing number of firewalls drop it by default, the sender never learns. It keeps retransmitting a packet that physically cannot fit, forever. This is a PMTUD black hole.
Small packets, a ping or the early bytes of an interactive SSH session, fit under any sane MTU, so they sail through. Bulk transfer is where the full-size segments appear, and that is where it dies.
finding it
The single most useful command here is ping with the Don't Fragment bit set and a payload size you control. On Linux:
# 1472 bytes payload + 28 bytes ICMP/IP headers = 1500 on the wire
ping -M do -s 1472 10.0.0.1
If that comes back, your path supports a 1500-byte MTU. If it returns "Message too long" or "frag needed and DF set", drop the size and try again. Walk it down until packets start getting through:
ping -M do -s 1464 10.0.0.1 # 1492 on the wire, typical for PPPoE
ping -M do -s 1372 10.0.0.1 # 1400, common for a chunky overlay
The largest size that succeeds, plus 28, is your real path MTU. The other tool worth knowing is tracepath, which walks the route and prints the MTU at each hop without you having to bisect by hand.
tcpdump confirms the diagnosis from the other side. Watch a stalled transfer and you will see the same large segment going out again and again with no acknowledgement coming back, whilst the small packets around it are fine.
fixing it
There are two honest fixes and one workaround.
The honest fix is to set the correct MTU on the interface so PMTUD never has to engage in the first place:
ip link set dev tun0 mtu 1400
The second, and the one I lean on for tunnels, is MSS clamping. You tell the router to rewrite the TCP MSS option in passing SYN packets down to something that fits, so neither end ever even tries to send an oversized segment:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --clamp-mss-to-pmtu
--clamp-mss-to-pmtu derives the value from the outgoing interface MTU, which is exactly what you want on a PPPoE or VPN edge. If you need a fixed number instead, swap it for --set-mss 1360.
The workaround, and please treat it as a last resort, is to stop dropping the ICMP "fragmentation needed" messages on your firewall so PMTUD can actually work. It should never have been blocked in the first place. ICMP type 3 code 4 is not an attack, it is the network trying to help you.
the lesson
If a link passes ping and SSH but chokes on bulk transfer, suspect MTU before anything else. It cost me an embarrassing amount of time the first time I met it, staring at firewall rules and routing tables that were all perfectly correct. The fault was 100 bytes of GRE header I had completely forgotten about. Now ping -M do -s is the second thing I type, right after ping, and clamping the MSS on every tunnel I build is just muscle memory. Set it and you will never lose an afternoon to this again.