I have two WAN connections at home, a proper fibre line and a cheap backup, and for ages my "failover" only worked when the primary failed in the one specific way I'd tested: someone unplugging the cable. Link goes down, router switches to backup, very tidy. The problem is that real outages almost never look like that.
What actually happens is the line stays up at layer 2, the modem keeps a happy green light, and yet nothing gets to the internet. The PPPoE session is fine, the link is fine, and packets quietly vanish somewhere upstream. Link-state failover sees a live interface and cheerfully keeps routing into the void.
The fix is to stop trusting the link and start trusting reachability. On my edge box that means an active probe against something stable, and demoting the route when the probe stops answering. The shape of it:
# ping two unrelated targets through the primary gateway
# if both go quiet, the primary is down regardless of link state
fping -q -c 3 -p 200 1.1.1.1 9.9.9.9
Probe two unrelated destinations so a single remote outage doesn't trigger a needless flap, run it every few seconds, and require a couple of consecutive failures before switching so you're not bouncing on one dropped packet. Equally, require a few consecutive successes before failing back, because flapping between links is worse than sitting on the backup for an extra minute.
Since I made that change the failover does what the word actually promises. The fibre had one of its silent half-hours a fortnight ago, the kind where the light's still green, and the only way I knew was the backup's higher latency showing up in a graph. Nobody in the house noticed, which is the entire point.