Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

Multi-WAN Failover That Actually Fails Over

Getting two WAN links to fail over cleanly at home, and why "the second link is up" isn't the same as "the internet works".

Network cables in a rack

I have two internet connections at home: a fast fibre line, and a cheaper backup over a different provider. For about a year the "failover" was a lie I told myself. The backup was plugged in, the router knew it existed, and the one time the fibre actually went down, nothing failed over at all. Everything just stopped, and I sat there watching a load of green status lights insist it was all fine.

The lesson, learned the slow way: link state is not connectivity. My router saw the fibre interface as up because the ONT was still happily talking PPPoE. The problem was three hops upstream at the provider, somewhere I couldn't see. From the router's point of view the link was perfect. From mine, nothing loaded. Failover that only triggers when an interface goes physically dark will miss most of the outages you actually care about.

Datacentre racks with cabling

The fix is to stop trusting the interface and start testing the path. I run a recurring health check on each WAN that pings a few well-known, reliable hosts out of that specific gateway, and only marks the link healthy if it can reach the wider internet. Not the gateway, not the provider's DNS, the actual far side. If the probes fail for long enough, that route gets withdrawn and traffic moves to the other link.

config interface 'wan'
    option proto 'pppoe'

config interface 'wanb'
    option proto 'dhcp'
    option metric '20'

# health-check, per link
config rule
    option interface 'wan'
    list track_ip '1.1.1.1'
    list track_ip '8.8.8.8'
    option reliability '1'
    option count '3'
    option up '3'
    option down '3'
    option interval '5'

A few things I got wrong before it behaved:

  • Pick stable, unrelated probe targets. If both links probe the same host and that host has a sulk, both links go "down" at once and you've built a coordinated outage. Use a couple of independent, boringly reliable IPs.
  • Tune the up/down counts. Too twitchy and a few dropped packets flap you back and forth, which is worse than just being down. I want a handful of consecutive failures before I declare a link dead, and a clean recovery window before I trust it again.
  • Mind the asymmetry. The backup is slower, so I don't fail back the instant the fibre returns. Let it stay healthy for a minute first, otherwise a flapping primary drags everything along with it.

The real test is the one nobody enjoys: physically unplug the primary mid-call and time how long until things recover. Mine now drops connections briefly and re-establishes on the backup within a few seconds, which for a video call is a hiccup rather than a disaster. Then plug it back in and confirm it comes home cleanly.

It's not glamorous and nobody will ever notice it working, which is rather the point. The day the fibre genuinely fails and I find out from a graph rather than from the family asking why the wifi's broken, the afternoon spent on probe targets will have paid for itself.