Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

making multi-wan failover do the one thing it promises

Why a "working" multi-WAN setup quietly didn't fail over until I tested it properly, and the health-check and gateway-monitoring changes that fixed it.

Network cables plugged into a switch

I had multi-WAN failover configured for about a year before I learned it didn't work. Two connections, a fibre line and a 4G backup, a router that proudly listed both as gateways, a smug little feeling that I was covered. Then the fibre actually went down one evening, and so did everything else, and the backup sat there doing nothing whilst I rebooted things in the dark like an amateur.

The lesson, which I keep having to relearn about anything labelled "failover": configured is not the same as tested. A failover path you've never deliberately broken is a path you don't have.

Why it didn't fail over

The default failover logic was watching for link state. As far as the router was concerned, the fibre WAN port still had a perfectly good link to the ONT, because it did. The ONT was up. The PPPoE session was up. It was everything past that, somewhere upstream, that had died. Link-state monitoring sees a cable, not the internet, and a cable was exactly what I had.

What you actually want is a health check that proves reachability, not link. Ping something reliable out of each WAN, repeatedly, and treat sustained loss as "this connection is down" regardless of what the physical port thinks.

# per-gateway monitoring, conceptually
monitor wan_fibre target 1.1.1.1 interval 1s loss-threshold 20% 
monitor wan_4g    target 9.9.9.9 interval 1s loss-threshold 20%
# route via wan_fibre while healthy; withdraw on threshold breach

The targets matter. Don't ping your ISP's own gateway, because that can stay up whilst their upstream is on fire, which is precisely the failure I'd hit. Pick a couple of stable public resolvers on different networks so one provider's bad day doesn't fool both checks. And ping out of the specific interface, not just "to the internet", or you'll happily health-check WAN1 over WAN2 and learn nothing.

The bits that bite afterwards

Getting the detection right is half of it. The other half is what happens at the moment of switching.

Existing connections drop. Your public IP changes, so anything stateful, SSH sessions, long downloads, a VPN tunnel, gets cut and has to re-establish. That's unavoidable on a NAT failover; just know it's coming and don't panic when the failover "breaks" your live session. It didn't, it did its job.

DNS can pin you to the dead path. If something cached an answer or a connection, give it a nudge. And watch the flap: if a link is marginal rather than dead, a tight threshold will see it bounce up and down every few seconds, which is worse than either state. I added a short hold-down so a recovered link has to stay healthy for a bit before traffic moves back.

Then, and this is the part everyone skips, I tested it. Physically unplugged the fibre at the wall whilst a continuous ping ran. A handful of dropped packets, then traffic carried on over 4G. Plugged it back in, watched it hold-down, then return. That ten-minute test was the only thing that turned "I think I have failover" into "I have failover".

It's not glamorous and nobody will ever see it work except me, on the one evening a year it matters. But that's rather the definition of infrastructure: invisible when it's doing its job, and only ever noticed when you skipped the testing.