building multi-WAN failover that fails over when you're not watching

Network cables running into patch panels

Lead with the conclusion, because it took me three attempts to earn it: most multi-WAN failover setups don't fail over. They fail over when you yank the cable, because that drops the link and the router notices immediately. They do nothing useful when the real-world outage happens, which is almost never a yanked cable and almost always an upstream that's technically "up" but dropping packets, returning garbage, or sitting at 90% loss while the interface stays cheerfully green.

I have two WANs at home: a primary fibre line and a slower secondary on a different physical medium and a different provider, which is the part that matters. Two connections from the same provider down the same duct fail together, and then you've bought yourself a more expensive single point of failure. The goal was simple to state: if the primary stops being usable, traffic should be on the secondary within seconds, and it should come back to the primary on its own once the primary is genuinely healthy again. Without me. At 3am. While I'm asleep.

why link state is the wrong signal

The first version watched link state. If the interface goes down, switch. This is the version everyone builds first and it's almost useless, because the failure mode I actually experience is the primary staying "up" while delivering nothing. The cable's connected, the PPPoE session is alive, the router is happy, and not a single packet is reaching the wider internet. Link state says everything is fine. Link state is lying.

What you actually need is reachability: can I get an answer back from somewhere real, out on the far side of this WAN, right now? That's an active health check, not a passive interface flag.

A small datacentre rack

health checks that don't lie

The second version pinged through each WAN to a target on the public internet. Better, but still naive, because a single ping target is a single point of failure of its own. The day that target had a wobble, my router decided my perfectly healthy primary was down and flapped traffic onto the slow line for no reason. A failover system that triggers on false positives is arguably worse than no failover at all, because it introduces outages instead of preventing them.

So the working version does a few things that, together, behave themselves:

It health-checks each WAN against multiple independent, reliable targets, bound to that specific WAN's source so the probe is forced down the link being tested. A WAN is only declared down when several targets fail together, which distinguishes "my link is broken" from "one host on the internet is having a bad day."

It uses a small number of consecutive failures before switching, and a larger number of consecutive successes before switching back. Asymmetric on purpose. You want to react quickly to a real outage but you do not want to flap back to a link that's only intermittently recovering. Quick to leave, slow to return.

It distinguishes loss from latency. A link that's up but at 40% packet loss is, for most things, worse than the slower backup. The check fails the primary on sustained loss, not just on total death.

In RouterOS terms the heart of it is a recurring check that pings bound to each WAN gateway and adjusts a route distance based on the result:

/ip route
add dst-address=0.0.0.0/0 gateway=wan1-gw distance=1 check-gateway=ping comment="primary"
add dst-address=0.0.0.0/0 gateway=wan2-gw distance=2 check-gateway=ping comment="backup"

check-gateway=ping gets you the basic version for free, but it pings the gateway, which is too local to catch an upstream-but-dead failure. The real logic lives in a scheduled script that probes proper external targets bound to each WAN's source address, counts failures, and only then changes the route distances or disables the primary route. The route table is the mechanism; the script is the brain that decides when to touch it.

the bit everyone forgets: NAT and conntrack

Here's the gotcha that ate an evening. When you fail over, your source IP changes, because you're now egressing a different WAN. Every existing connection was NATed out of the old WAN's address, and the far end has no idea your IP just changed. Those connections don't migrate. They die. A long-lived SSH session, a download, a video call: they all drop at the moment of failover and have to re-establish.

There's no magic that fixes this; you can't keep a TCP connection alive across a source IP change. What you can do is make sure new connections immediately use the right path and that the conntrack table doesn't keep stubbornly routing replies back out a dead link. So part of the failover action is flushing the relevant connection tracking, so nothing lingers trying to use a WAN that's gone. Accept that a failover is a brief, noticeable hiccup, not a seamless glide, and design the rest of your expectations around that.

testing it like you mean it

This is the part people skip and it's the only part that proves anything. Pulling the cable is not a test, because it only ever exercises the easy failure. I tested the hard ones:

I blocked the health-check targets at the primary's far edge so the link stayed up but probes failed, simulating the upstream-but-dead case. Failover triggered correctly, on probe failure, not link state. That's the whole point, proven.

I introduced deliberate packet loss on the primary and confirmed it failed over on sustained loss rather than waiting for total death.

I flapped the primary up and down rapidly and confirmed the asymmetric counters stopped it from thrashing traffic back and forth. It rode out the noise and only returned once the primary was solidly back.

And I ran the whole thing for a fortnight before trusting it, watching the logs for false positives, because a failover system you haven't watched misbehave is one you don't actually understand yet.

was it worth it

Yes, and more than I expected. The primary has had two genuine wobbles since I finished this, both of the up-but-dead variety that the naive version would have slept straight through, and both times the house carried on over the backup with nothing more than a brief stutter and a line in the log. Nobody else noticed, which is the highest compliment you can pay infrastructure.

The honest summary: link-state failover is theatre, reachability-based failover with multiple targets and asymmetric counters is the real thing, and the difference between them is precisely the difference between a setup that works in the demo and a setup that works at 3am when nobody's looking. The second one is the only one worth building.