The symptom was the kind that makes you doubt your own eyes. A service behind a new site-to-site VPN answered some requests instantly and hung forever on others. curl to the health endpoint: fine. curl to anything that returned a real payload: nothing, then a timeout. SSH connected, let me log in, then froze the moment I ran ls on a directory with a lot of entries. Small things worked. Big things didn't.
When the line between "works" and "hangs" is the size of the response, stop guessing and start counting bytes. This is almost never the application. It's the path.
I ran the usual confirmation. Ping with a normal payload was happy. Ping with a large payload and the don't-fragment bit set was not.
$ ping -M do -s 1472 10.20.0.10
PING 10.20.0.10 (10.20.0.10) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1500
^C
That on its own only tells you the local interface MTU. The interesting question is what the path will actually carry. So I walked it down until packets started getting through.
$ ping -M do -s 1422 10.20.0.10 # 1450-byte packet: drops
$ ping -M do -s 1372 10.20.0.10 # 1400-byte packet: replies
So the effective path MTU was somewhere around 1400. The VPN added its own encapsulation overhead on top of the standard 1500, and the result was that any full-size TCP segment got silently binned somewhere in the middle. A small HTTP response fits in one undersized segment and sails through. A large one needs full-size segments, those exceed the path MTU, and they vanish.
Normally this fixes itself. Path MTU Discovery exists precisely for this: a router that can't forward a too-big don't-fragment packet is supposed to send back an ICMP "fragmentation needed" message, the sender shrinks its segments, everyone gets on with their lives. The whole mechanism rides on those ICMP messages getting back to the source. And there, of course, was the culprit. A firewall in the path had been configured by someone who reads "ICMP" as "ping" and "ping" as "the thing attackers use", so it dropped the lot. No ICMP, no discovery, no clue. Just a black hole that ate large packets and left small ones alone.
Two ways out. The correct one is to let that ICMP type through, because Path MTU Discovery is load-bearing infrastructure and breaking it breaks more than this one tunnel. The pragmatic one, while waiting for the firewall change to be approved by people who own a change board, is to clamp the TCP maximum segment size at the tunnel so the endpoints never try to send anything too big in the first place.
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --clamp-mss-to-pmtu
That made everything work immediately, which is the dangerous bit, because a workaround that works is a workaround you forget to remove. I left the clamp in, got the ICMP rule fixed properly the following week, and wrote "it was the MTU" on a sticky note that has been on my monitor ever since. It usually is.