Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

multi-wan failover that actually fails over

Why a second WAN link only counts as failover once you stop trusting link-state and start probing real targets.

Network patch cables in a rack

I had "multi-WAN" for about a year before I had failover. The two are not the same thing, and the gap between them is where you find out at 23:00 on a Sunday that the secondary link was never actually carrying traffic.

The trap is link-state. My first attempt watched the WAN interface and switched over when the link went down. Trouble is, the link almost never goes down. The fibre ONT stays happily lit while the ISP's upstream falls over somewhere I can't see. Carrier present, no packets. From the router's point of view everything is fine, and it cheerfully keeps blackholing every connection out of a dead pipe.

The fix is to stop asking the interface and start asking the internet. On OPNsense I set the gateway monitoring to probe real, well-distributed addresses (1.1.1.1 on the primary, 9.9.9.9 on the secondary, deliberately different so a single resolver outage doesn't take both monitors down), with packet-loss and latency thresholds rather than a binary up/down. When loss on the primary crosses the threshold for a few seconds, the gateway goes down in the routing table and the policy route shifts to the backup. Carrier state never enters into it.

Test it properly: don't unplug the cable, that's the easy case. Null-route the monitor target upstream, or just have the ISP fall over on its own (mine obliges roughly monthly). If traffic keeps flowing, you have failover. If it doesn't, you have two WAN links and a false sense of security.