making multi-WAN fail over before anyone notices

A bundle of network cables and a switch

I had two internet connections for the better part of a year and, embarrassingly, no working failover between them. I had configured failover, which is not the same thing. The day the primary line wobbled, the router cheerfully kept routing everything down the dead link because, as far as it was concerned, the interface was up. Up is not the same as working, and that gap is where most "multi-WAN" setups quietly fail.

So over the break I rebuilt it properly. The goal was simple and unsexy: when the primary line stops carrying traffic, switch to the backup before anyone in the house has time to say "is the internet broken?"

Up is not the same as reachable

The core mistake in my first attempt was trusting link state. A cable modem or fibre ONT will happily present an up interface while the connection upstream of it is dead. The carrier's gear is fine; your path to the actual internet isn't. Link state tells you about the cable to the box on the wall, nothing more.

So the failover has to be driven by an active health check, not by the interface. Mine pings a couple of stable, well-distributed targets out of each WAN, and marks a link down only when several probes in a row fail across more than one target. Multiple targets matter: if you health-check a single IP and that one host has a bad day, you'll flap your WAN for no reason.

A datacenter aisle with cabling

I'm running this on an OPNsense box, and its built-in gateway monitoring does most of the heavy lifting once you stop trusting the defaults. The settings that actually mattered:

# Per-gateway monitoring, conceptually
monitor_ip      = 1.1.1.1     # WAN1 probes this
monitor_ip      = 9.9.9.9     # WAN2 probes a different target
probe_interval  = 1s
loss_threshold  = 20%         # mark down above this sustained loss
latency_high    = 500ms       # and flag high latency separately
down_after      = 5 failures  # debounce so a blip doesn't flap

The two numbers that earn their keep are the debounce and the per-gateway monitor IP. Debounce stops a one-second hiccup triggering a failover. Separate monitor IPs stop a single unlucky host from declaring your perfectly good link dead.

Failover is the easy half; failback is the trap

Switching to the backup is straightforward once the health check is honest. The part that bit me was failback. When the primary recovers, you do not want to slam everything back the instant the first probe succeeds, because the line that just came back is frequently still flaky for a minute or two. So failback gets its own, longer dwell: the primary has to stay healthy for a sustained window before traffic returns to it. Asymmetric thresholds, quick to leave a bad link, slow to trust it again.

There's also the matter of existing connections. A long-lived SSH session or a download that was pinned to WAN1 doesn't magically move; it dies when WAN1 dies, and that's fine, it reconnects over WAN2. What you're protecting is new connections, and the experience of the household, not the sanctity of any individual TCP stream. I set policy routing so that latency-sensitive things (video calls) prefer the better link, and bulk traffic can ride whichever is up.

Testing it for real

The single most useful thing I did was test the failure deliberately, repeatedly, while watching. Not "unplug it and see", but actually pulling the primary at known times and timing the cutover with a continuous ping running in another window:

ping -i 0.2 1.1.1.1

With the config above, a hard primary failure costs me roughly two to three seconds of dropped packets before the backup takes over. A video call survives it with a brief stutter. Crucially, nobody has to do anything, and nobody phones me. The first time it failed over for real, mid-January as it turned out, I only knew because I check the logs, not because anything broke.

That's the bar I was aiming for. Failover that you find out about from a graph, not from a tap on the shoulder. The config was the small part. Deciding to actually distrust "up", and then sitting there yanking cables until I believed it, was the part that made it real.