multi-wan failover that actually fails over

Network cables plugged into a switch

I had two WANs for the best part of a year before I found out that only one of them worked. Not that the second was misconfigured, it had carried traffic fine during the initial test. The problem was subtler and far more embarrassing: I'd built a failover that only watched the wrong thing.

The setup is the usual homelab story. A fibre line as primary, a 4G connection as backup, both terminated on an OPNsense box doing gateway-group failover. The logic, as I'd configured it, was "if the primary interface goes down, switch to backup". Which sounds correct and is the default mental model. It is also wrong, because an interface being up tells you almost nothing about whether packets reach the internet.

the day the test lied

I only learned this because the fibre had a genuinely odd fault. The line stayed up, the PPPoE session stayed up, the interface showed a happy green light, and yet nothing past the ISP's first hop was reachable. Some routing problem on their side. From OPNsense's point of view the primary gateway was perfectly fine, so it never failed over, and everything behind it sat there timing out for the better part of an afternoon while I assumed the outage was wider than it was.

The lesson landed hard. "Is the link up" and "can I reach the internet" are different questions, and a failover that only answers the first is decoration.

A rack of network equipment in a datacentre

monitoring the path, not the cable

The fix in OPNsense is to give each gateway a monitor IP, an address out on the actual internet that the dpinger daemon pings continuously. The trick is choosing it well:

Don't use the ISP's own gateway. That's the thing that lied to me. It'll happily answer while the path beyond it is broken.
Don't use 8.8.8.8 for both WANs. If Google has a wobble, or your route to it has a wobble, both gateways look dead at once and you've achieved nothing.
Pick a different, stable, far-away target per WAN. I use one of the Quad9 addresses for one and a Cloudflare one for the other, so a single upstream outage can't blind both monitors simultaneously.

There's a static-route subtlety that bites everyone here. If you tell dpinger to monitor 9.9.9.9 via the backup WAN, you must pin a static route for 9.9.9.9 out of that interface, otherwise the monitor traffic follows the default route (the primary) and you're, once again, testing the wrong path. OPNsense mostly handles this, but verify it with a traceroute from the right source interface rather than trusting the GUI.

With that in place the behaviour finally matches the intent. dpinger watches latency and loss to a real destination on each WAN, and when loss on the primary crosses the threshold the gateway group demotes it and traffic moves to the 4G line within a few seconds. I tested it the honest way this time, by unplugging the fibre at the ONT and watching a continuous ping to an external host blip and recover, rather than by reading a status page.

The wider point is one I keep relearning in different forms. Redundancy you haven't broken on purpose isn't redundancy, it's a theory. The interface light is the easiest thing to monitor and the least meaningful. Monitor the thing you actually care about, the path, and test it by causing the failure, not by admiring the green light.