multi-wan failover that actually fails over

Network switches and patch cables

I had multi-WAN failover configured for ages, felt smug about it, and then the primary line had a bad afternoon and the failover did nothing. The internet "worked" and also didn't: pages half-loaded, calls dropped, everything timed out and retried. The backup line sat there, fully functional, untouched. So much for resilience.

The problem was the definition of "down". My failover was watching link state on the WAN interface. If the cable went dead or the modem dropped sync, it would switch. But the line hadn't gone down in that sense. It was up, the interface had carrier, the gateway answered ARP. It was just losing packets and adding latency by the bucketload. As far as my router was concerned the primary was perfectly healthy, because I'd told it to check the wrong thing.

Link state tells you the cable is plugged in. It tells you nothing about whether packets reach the internet. Those are very different questions, and a degraded line answers yes to the first and no to the second.

A datacentre aisle with cabling

The fix was an active health check that actually tests reachability, not carrier. On pfSense this is the gateway monitoring under the gateway settings: instead of pinging the ISP's own gateway (which stays up even when their network behind it is on fire), point the monitor at something neutral and far enough out to be meaningful. I use a public resolver, 1.1.1.1 on the primary and 8.8.8.8 on the backup, so the two checks aren't testing the same upstream.

The thresholds matter as much as the target. The defaults were too forgiving for a flaky line, so I tightened them:

Probe interval:   1s
Loss thresholds:  10% / 20%   (warning / down)
Latency thresholds: 300ms / 500ms

Now a gateway gets marked down on sustained loss or latency, not just on a dead cable. The moment the primary started shedding packets, the monitor saw the loss climb past the threshold, marked it down, and the failover policy moved traffic onto the backup. When the primary recovered and stayed healthy for the probe window, it came back.

If you're doing this with plain Linux and policy routing rather than pfSense, the same principle applies: don't trust ip link. Run a real probe, something like a periodic ping or a small script checking loss against a public host, and flip your default route's metric or swap a routing rule based on the result. The mechanism varies, the lesson doesn't.

The lesson being: failover that triggers on link state is failover that handles the failure mode you'll almost never see, while ignoring the one you'll actually get. Lines rarely die cleanly. They degrade, they flap, they lose ten percent and limp. If your health check can't see that, you don't have failover, you have a backup line and a false sense of security. Test what you actually care about, which is "can packets get out", and set the thresholds tight enough to notice when the answer turns to no.