Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

Multi-WAN Failover That Actually Fails Over

Building a second WAN with real, tested failover on OPNsense, and the three quiet ways my first attempt would have failed me when it mattered.

Network cables in a patch panel

I have run a "failover" setup before that did not fail over. The backup link was there, the config looked right, and the one time the primary actually went down the whole house lost the internet anyway. That experience taught me the only thing worth saying about redundancy up front: if you have not watched it fail and recover, you do not have failover, you have a second bill.

So this time I did it properly, on OPNsense, with a fibre primary and a 5G modem as backup, and I tested it until I trusted it.

The naive version, and why it lies

The obvious setup is a gateway group. You add both WAN interfaces, set the fibre to Tier 1 and the 5G to Tier 2, and tell OPNsense to use the group instead of a single gateway in your outbound rules. On paper, when Tier 1 dies, traffic shifts to Tier 2.

This works in exactly one scenario: the cable falls out. The router sees the interface go down, the gateway goes down with it, and failover triggers cleanly.

The trouble is that in the real world links rarely die that obligingly. The far more common failure is the link staying up while the path beyond it rots. The fibre ONT still shows a carrier, the interface is up, the gateway is "online", and yet nothing actually reaches the internet because something upstream at the provider has fallen over. A naive setup is blind to this, because it only watches the interface, not the path.

Monitoring the path, not the wire

The fix is to give each gateway a monitor IP, a host out on the wider internet that OPNsense pings continuously. Point the fibre gateway's monitor at one well-known public resolver and the 5G gateway's monitor at a different one, so a single distant outage cannot mark both links down at once.

Now the decision is based on reachability, not link state. If the fibre interface is up but the pings stop returning, OPNsense raises the loss and latency, the gateway is marked down, and the group fails over. This single change is the difference between failover that works and failover that only works in the demo.

A rack of networking equipment in a datacentre

I tuned the thresholds to be patient rather than twitchy. A backup like a metered 5G modem is not somewhere you want to live by accident, so I made it take a sustained problem, not a single dropped packet, to trigger a switch.

# Gateway monitoring, roughly what I settled on
Monitor IP:        9.9.9.9        (fibre)   /  1.1.1.1 (5G)
Probe interval:    1s
Loss threshold:    20% high
Latency threshold: 500ms high
Time period:       60s

Sixty seconds of genuinely bad before it acts. That sounds slow, but a flapping WAN that bounces between two links every few seconds is far worse than one that waits a minute and then commits.

The three quiet failures

Once the basic group worked, I went looking for the ways it would still let me down. There were three, and they are the interesting part.

The first is DNS. If your LAN clients are pointed at a resolver that only answers over the primary link, failover at the routing layer does nothing, because name resolution still dies. I moved DNS onto the router with Unbound, so the resolver itself follows the failover and clients never know which WAN answered.

The second is stateful connections. When a link drops, every connection that was riding it is dead, and the firewall's state table still thinks they are alive. A long-lived SSH session or a VPN tunnel will hang rather than reconnect, because the packets are being sent into a void the state table believes is fine. I added a rule to flush states for the dead gateway on failover, which forces those connections to give up and re-establish over the surviving link. Painful for that one session, correct for everything that follows.

The third is the return path. If you run anything inbound, a VPN endpoint, a server you reach from outside, the reply has to leave by the same WAN the request arrived on, or the far end sees an asymmetric route and drops it. Reply-to on the WAN rules handles this, and OPNsense mostly sets it for you, but it is worth checking rather than assuming.

There is a fourth that sits underneath all of them, which is that your two links should not share a single point of failure you have forgotten about. Mine did, briefly. Both the fibre router and the 5G modem were plugged into the same little switch, which was plugged into the same UPS, which was fine until the day I needed to reboot that switch and discovered I had built two WANs onto one power strip. Redundancy has a habit of quietly collapsing back to one if you stop looking at it, so I now trace the dependency chain all the way to the wall.

Testing it like you mean it

The whole point of this post is that none of the above counts until you have proven it. So I tested it the only honest way: by physically pulling the fibre while watching a continuous ping from a laptop on the LAN.

The result was a handful of dropped packets, a pause of around a minute as the thresholds did their job, and then the stream resumed over 5G without me touching anything. Plugging the fibre back in failed back the same way, a little more eagerly because I let recovery trigger faster than failure.

That ping log is the only documentation I trust. Configs lie, dashboards lie, "online" lies. A continuous ping through a real outage tells you the truth, and now when the fibre genuinely drops one wet Tuesday, I will not be the last to know. The house will just stay online, and I will find out from a log rather than from everyone shouting up the stairs.