I have two internet lines into the house, a primary and a cheaper backup, and for the better part of a year I had been quietly proud of my failover setup. Then the primary line did the worst possible thing, which is not go down, and I discovered my failover didn't.
The primary didn't drop. The link stayed up, the router still had a default route through it, the interface was green. But upstream, somewhere past my modem, something was broken: packets went out and most never came back. From my router's point of view everything was fine. The cable was plugged in, the link light was on, the route was valid. So it cheerfully kept sending all my traffic into a hole.
That's the core lesson. Link state is not connectivity. A WAN interface being "up" tells you the physical and link layers are happy, and tells you nothing about whether you can actually reach the internet. My failover logic was watching the wrong thing. It was watching the cable, not the connection.
The fix is an active health check that proves real reachability, end to end, and crucially does so out of each WAN independently. On my OPNsense box this is the gateway monitoring feature, which pings a target through a specific gateway and watches latency and packet loss. The important detail is the target. You cannot ping your own ISP's gateway, because that's reachable even when the path beyond it is broken, which is exactly the failure I had. You ping something well past it, a public resolver like 1.1.1.1 or 8.8.8.8, pinned to go out of that specific WAN. When loss on the primary crosses a threshold, the gateway is marked down and the routes shift to the backup, link light be damned.
A few things I learned tuning it:
- Bind the monitor target to the gateway, not the route table, or the check will helpfully succeed by going out the other WAN and tell you nothing.
- Pick a threshold with hysteresis. A single dropped ping is not an outage. I went with sustained loss over several seconds before failing over, and a delay before failing back, so a flapping line doesn't flap my whole network with it.
- Test it for real. Don't unplug the cable, because that's the failure mode that already worked. Block the monitor target upstream, or null-route it, and watch whether failover actually triggers. Mine didn't, the first time, which is the entire reason this post exists.
After all that it now does the thing it always claimed to do. When the primary goes grey rather than dark, the box notices within a few seconds, swings everything onto the backup, and swings it back once the primary has been healthy for a minute. The link light, it turns out, was never the point.