Ramblings of an aging IT geek
← Ramblings of an aging IT geek
networking

two connections, one that works, and the gap between them

Setting up a second internet connection for failover and discovering that the hard part is not the routing but detecting that the primary link has actually died.

Patch cables running into a network rack

I added a second internet connection so that when the primary went down, which it does, the house would limp along on the backup without me having to do anything. This is the sort of project that sounds like an afternoon and turns into a fortnight, not because the routing is hard, but because the question "is the primary link actually down?" is much harder to answer than I expected.

The naive version is easy and wrong. You give the router two WAN interfaces, set the primary as the default route with a low metric and the backup with a higher one, and trust the kernel to use the backup when the primary disappears. This works beautifully in exactly one scenario: someone unplugs the primary modem. The link goes down, the route is withdrawn, traffic shifts. Lovely.

That is almost never how internet connections fail.

The failure I actually see is far nastier. The cable is connected, the modem shows a happy light, the interface is up, the route is present, and yet not a single packet reaches the wider internet. The line has dropped somewhere upstream, at the exchange, at the ISP, anywhere past the bit I can see. From the router's point of view the link is perfectly healthy. From the point of view of anyone trying to load a web page, it is dead.

A failover that only watches the link state will sit there pointing at a dead primary forever, because the link state is fine. The router is technically correct and completely useless, which is the worst combination.

The gateway box that has to decide which connection is really alive

The fix is to stop trusting the interface and start testing reachability. The router actively pings something out on the real internet, through each WAN in turn, and decides a connection is up only if the pings come back. Not the ISP's gateway, which can answer whilst the wider line is down, but several stable, independent hosts well outside the ISP's network. Public DNS resolvers are the usual choice because they are designed to always answer and they are run by people who are not my ISP.

The logic has to be a little careful or it will flap. One lost ping is not an outage, it is Tuesday. So you require several consecutive failures across multiple targets before declaring a link dead, and several consecutive successes before declaring it alive again. Otherwise a moment of packet loss bounces you onto the backup and back, repeatedly, which is more disruptive than the brief loss would have been. The shape of it, stripped down:

TARGETS="1.1.1.1 8.8.8.8 9.9.9.9"
FAILS=0
for t in $TARGETS; do
  ping -I "$WAN_IF" -c 1 -W 2 "$t" >/dev/null 2>&1 || FAILS=$((FAILS+1))
done
# only act after this has failed enough times in a row, not on one bad poll

Pinging through a specific interface matters more than it looks. If you just ping a target with no source binding, the kernel sends it out the current default route, which tells you nothing about the backup. You have to force each probe out of the interface you are testing, so you are genuinely asking "can this connection reach the internet" and not "can the connection I am already using reach the internet".

Failing back is its own trap

Getting onto the backup is half the job. Getting back is the other half, and the half people forget. When the primary recovers you want to return to it, because it is faster, or has the higher data allowance, or is simply the one you pay more for. But you do not want to return the instant it twitches back to life, because a primary that is flapping up and down will drag every connection in the house with it as you bounce between WANs. So failback waits: the primary has to pass its health check cleanly for a sustained period, a couple of minutes say, before traffic moves back. Stability over speed.

There is also the matter of existing connections. When you fail over, anything that was mid-flight, an ssh session, a download, a video call, breaks, because its packets were going out one public address and now leave by another. There is no avoiding this without proper session-aware kit I do not have at home. So I accept it. The failover's job is to make new connections work within seconds, not to keep old ones alive. The call drops, you redial, and the redial goes straight out the backup. Annoying, but the alternative is the whole house being offline until I notice and intervene.

NAT and the small print

Two details bit me that have nothing to do with the failover logic itself and everything to do with the plumbing around it. The first is NAT. Each WAN has its own public address, so the firewall needs a masquerade rule per WAN, applied to whichever interface the traffic actually leaves by. Get this wrong and failover "works" in that the route changes, but the return packets never find their way home because they were sourced from the wrong address. The rule has to follow the outbound interface, not be pinned to the primary.

The second is DNS. It is tempting to point everything at the primary ISP's resolvers, and on the day the primary's reachability dies but its resolvers still answer, you get a household where DNS works but nothing loads, which is its own special flavour of confusing. I run a local resolver that forwards to the public ones I already use as probe targets, so name resolution rides the same health-checked path as everything else. If a WAN can reach the internet, it can resolve names. If it cannot, the resolver fails over with the routing rather than independently of it.

Neither of these is hard. They are just the kind of thing that does not show up when you test by pulling a cable in daylight with everything warm, and absolutely shows up at eleven at night when the line drops for real and half the failover works.

The test that finally convinced me it worked was not unplugging a cable. Anyone can survive an unplugged cable. I blocked the probe targets at the primary modem so the link stayed up whilst reachability died, exactly the silent failure that fooled my first attempt. A few seconds later the router quietly moved everything to the backup, and when I lifted the block and waited, it moved back. That is the behaviour I wanted: not failover when the cable is pulled, but failover when the connection lies to me about being alive.