The symptom was maddeningly specific. Small requests worked. Large ones hung. A health check returning a few bytes of JSON was instant and reliable. A request that returned a real payload, a few kilobytes, would establish the connection, send the request, and then sit there until it timed out. Every time, but only over one particular path.
When a connection completes the handshake fine but stalls partway through moving data, and the size of the data is what decides whether it works, there is one answer that should come to mind before any other. It's the MTU. It's always the MTU.
ruling things out
I didn't jump straight there, because that would deny me the pleasure of a couple of hours chasing the wrong thing first. I checked the application. The application was fine; it sent the response, the logs proved it. I checked for packet loss with a flood ping and saw none. I checked the firewall, twice, because firewalls are where hope goes to die, and the rules were permissive.
What gave it away was a small ping versus a large one:
$ ping -c 3 -M do -s 1400 10.20.0.5
PING 10.20.0.5 (10.20.0.5) 1400(1428) bytes of data.
1408 bytes from 10.20.0.5: icmp_seq=1 ttl=63 time=0.9 ms
$ ping -c 3 -M do -s 1472 10.20.0.5
PING 10.20.0.5 (10.20.0.5) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1500
The first ping, with -M do setting the don't-fragment bit, went through at 1400 bytes of payload. A larger one didn't get out at all. So somewhere on this path the usable MTU was below the standard 1500, and anything that tried to send a full-sized frame with DF set was being dropped on the floor.
The reason it was below 1500 was the part I'd half-forgotten. This path went over a tunnel, and a tunnel wraps your packets in an outer header. Those extra bytes have to come from somewhere, and they come from the payload budget. So the real MTU inside the tunnel was something like 1420, not 1500, and nobody had told the endpoints.
why small worked and large didn't
The handshake works because SYN and ACK packets are tiny, well under any MTU you'll meet. The health check works because its response is tiny too. The full payload doesn't work because TCP, having seen no reason to think otherwise, builds segments up to the 1500-byte MTU it found on its local interface. One of those oversized segments hits the tunnel, is too big to pass with DF set, and should trigger path MTU discovery: the router in the middle sends back an ICMP "fragmentation needed" message telling the sender to use smaller packets.
Should. The ICMP was being dropped. Somewhere on the path, a firewall or a misconfigured box was swallowing ICMP type 3 code 4, which is exactly the message PMTUD depends on. So the sender never learned to shrink its packets. It just kept retransmitting the same too-large segment into a void, which is precisely the hang I was watching. This is the classic PMTUD black hole, and it's nasty because everything looks healthy. The connection is up. There's no error. Data simply stops.
the fix, and the better fix
The immediate fix was to set the MTU correctly on the tunnel interfaces so the stack knew the real number from the start:
ip link set dev tun0 mtu 1420
That stops the oversized segments being built in the first place, so you never depend on PMTUD to rescue you. For TCP specifically you can also clamp the maximum segment size at the boundary, which tells both ends to negotiate a smaller MSS during the handshake:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --clamp-mss-to-pmtu
MSS clamping is the belt-and-braces option for tunnels, because it fixes the segment size up front rather than relying on an ICMP message that may never arrive. And the real underlying fix, which I raised separately, was to stop dropping the ICMP. Blanket-blocking ICMP feels secure and is actively harmful, because you break path MTU discovery and create exactly this kind of silent hang. "Frag needed" is not an attack, it's the network trying to help you.
the lesson, again
I keep relearning this one. Any time a connection establishes fine and then stalls on larger transfers, and especially if a tunnel or a VPN is anywhere on the path, check the MTU before you check anything else. The ping-with-DF trick will tell you in thirty seconds what the application logs will never admit. Small packets fine, big packets gone, is not a subtle signature. It's the MTU. It's always the MTU, and one day I'll check it first.