I had a backup WAN for months that I was quietly proud of, right up until the primary actually failed and the backup didn't take over. The link hadn't gone down in the way I'd planned for. The interface stayed up, the PPPoE session stayed up, and as far as the router was concerned everything was fine. It just couldn't pass traffic past the ISP's first hop.
The mistake was using link state as my health signal. "Is the cable plugged in" is not the same question as "can I reach the internet", and the gap between them is exactly where the annoying failures live. A modem that's powered but not syncing, an upstream outage at the ISP, a routing fault three hops away: the interface stays green through all of it.
The fix was to probe real targets instead. On OPNsense the gateway monitoring lets you set a monitor IP per gateway, so I point each WAN at something well outside my ISP, a couple of public resolvers on different networks, and the dpinger daemon watches latency and loss continuously. When the probes stop coming back, the gateway is marked down and the failover policy actually triggers, regardless of what the link layer claims.
Two details that mattered. Pick monitor targets that aren't going to vanish and aren't on a path you share with the thing you're testing, or you'll get false positives. And test it by pulling the upstream, not the cable, because pulling the cable only proves the easy case you already had working. The first time I yanked the primary's modem power and watched traffic shift across without dropping my SSH session, it finally felt earned.