a day lost to packets that almost made it

A terminal full of tcpdump output beside a bug icon

The symptom was the kind that makes you doubt your own competence. SSH into the new site worked. curl against the internal API returned its little JSON greeting instantly. ping was clean, sub-20ms, zero loss over a thousand packets. And yet apt update hung halfway through, a git clone of anything non-trivial froze at a random percentage, and one particular POST to the API never came back. Small things flew. Large things died. Everything that died, died silently.

I want to write this one down properly because I lost the better part of a day to it, and because every time I hit a problem with this exact shape I forget the diagnosis and rediscover it from scratch. So this is partly a runbook for future me, who will absolutely need it again.

The setup

We'd just brought up a new branch office on a site-to-site VPN. WireGuard over a commodity broadband line, terminating on a small Linux box at each end, routing a /24 behind it. Nothing exotic. The link came up, the handshake completed, the routes propagated, and the first round of smoke tests all passed. I marked it done and moved on, which is of course when it started biting people.

The reports trickled in over the next hour. "The wiki is slow." "I can't pull the repo." "The file share keeps stalling." None of it was a clean outage, which is the worst kind. A clean outage you can see. This was a link that worked just enough to look healthy on every quick test and fall over on every real workload.

Ping lies to you

The instinct is to reach for ping, and ping will betray you here, because a default ping is a 64-byte packet. Tiny. It sails through. The problem only shows up once a packet gets large enough to matter, and the default tests never send one.

So the first useful move is to make ping send a big packet and forbid fragmentation:

ping -M do -s 1472 10.20.0.1

The -M do sets the don't-fragment bit. The -s 1472 asks for 1472 bytes of payload, which plus 8 bytes of ICMP header plus 20 bytes of IP header is exactly 1500, the standard Ethernet MTU. On a healthy path that succeeds. On ours:

ping: local error: message too long, mtu=1420

There it is. The WireGuard interface has an MTU of 1420, because the encapsulation eats into the 1500 you started with, and that is entirely correct and expected. The bug isn't that the tunnel MTU is smaller. The bug is what happens to packets that don't know it.

A diagram of nested packet headers shrinking the usable payload

Path MTU Discovery, and why it wasn't working

The mechanism that's meant to handle this is Path MTU Discovery. When a router has a packet too big for the next hop and the don't-fragment bit is set, it's supposed to drop the packet and send back an ICMP "fragmentation needed" message telling the sender the correct MTU. The sender shrinks its packets and life goes on. It's an elegant little protocol and it works beautifully right up until something eats the ICMP.

And something always eats the ICMP. Overzealous firewalls block ICMP wholesale because "ICMP is a security risk" is a sentence people repeat without finishing it. A NAT box doesn't forward the unreachable back to the right host. A cloud security group only allows echo. Whatever the cause, the "packet too big" message never arrives, the sender keeps cheerfully transmitting 1500-byte frames, every one of them gets dropped at the tunnel, and the connection hangs. Small flows fit and work. The first full-size segment of a bulk transfer vanishes into the void. This is called a PMTU black hole, and it is the single most common cause of "the network is up but big things don't work" I've ever met.

You can confirm it by watching the drops. On the tunnel endpoint:

tcpdump -ni wg0 'ip[6] & 0x40 != 0 and ip[2:2] > 1420'

Sure enough, full-size frames with DF set, arriving, going nowhere.

The fix

The right fix, the load-bearing one, is to make TCP negotiate a sane segment size in the first place, so it never tries to send something too big. That's what MSS clamping does. You intercept the SYN packets and rewrite the maximum segment size option down to fit the tunnel:

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

--clamp-mss-to-pmtu sets the MSS based on the outgoing interface MTU, so on the wg0 path it lands at 1380 (1420 minus 40 bytes of TCP/IP header). Once that rule was in place on both endpoints, the SYN handshake advertised a segment size that actually fit, TCP stopped trying to send oversized frames, and every stalled transfer started flowing. apt update completed. The repo cloned. The POST came back. No more black hole, because nothing tried to push a packet that needed the ICMP that nobody would deliver.

I also dropped the don't-fragment-relying paths a backstop by not blocking ICMP type 3 code 4 at our own firewalls, on principle, but the MSS clamp is what actually carried the day. PMTUD is a nice-to-have. MSS clamping is the thing that doesn't depend on the rest of the internet behaving.

The lesson, again

The frustrating part is that I know this. I've fixed this exact bug on GRE tunnels, on IPSec, on PPPoE links where the 8-byte overhead drops you to 1492 and ruins someone's Tuesday. Every encapsulation that shrinks the usable MTU sets the same trap, and every time the symptoms are identical: small works, large hangs, ping is clean, and you spend an hour suspecting the application before you remember to suspect the packet size.

So the heuristic, written down where I'll find it: if a connection completes its handshake but stalls on bulk transfer, and small requests succeed while large ones hang, stop debugging the application. It is the MTU. It is almost always the MTU. Send a big don't-fragment ping, find the real path MTU, and clamp the MSS to fit. Then go and have the tea you should have had three hours ago.