Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

one router, a thousand grounded flights

What an airline grounded by a single failed router has to teach the rest of us about single points of failure and the failover we never test.

A city skyline at dusk

Southwest Airlines spent the back half of last week in a hole. It started on 20 July, and over the days that followed they cancelled or delayed thousands of flights, stranded passengers across the country, and made the sort of headlines no operations team ever wants. The proximate cause, as it filtered out, was a single failed network router. One box, and a major airline could not reliably board aeroplanes.

I have no inside knowledge of their systems and I'm not going to pretend I do. But the public shape of it is familiar enough that I think it's worth sitting with, because the failure mode is one almost all of us are carrying somewhere, whether we admit it or not.

It's tempting, from the outside, to be smug about this. An airline grounded by a router sounds like a punchline, the sort of thing that could only happen to an organisation with creaking legacy systems and a fear of change. Maybe some of that is fair. But I've worked on enough infrastructure to know that the distance between "their single point of failure" and "my single point of failure" is one missed dependency and a year of incremental change. Nobody designs these things in. They accrete. And the larger and older a system gets, the more places they have to hide.

The single point of failure you forgot you had

The thing about a single point of failure is rarely that nobody knew it existed. It's that everybody assumed something else covered it. There's a piece of kit, or a service, or a region, that the whole edifice quietly leans on, and the architecture diagram shows it with a reassuring little "x2" next to it, and at some point the redundancy stopped being real and nobody noticed because nothing failed.

An aerial view of a dense city at night

What turns a dead router into a multi-day national news story is not the router. Routers die. It's that the recovery didn't go the way the runbook promised. Either the failover didn't trigger, or it triggered and the secondary couldn't take the load, or the systems came back but in an order that left them confused about the state of the world, and reconciling that took days. The hardware failure is the spark. The duration of the fire is a software and process story.

Failover you haven't tested is not failover

Here is the uncomfortable bit, and it applies to everyone reading this who has a "highly available" anything in production.

If you have not deliberately killed the primary and watched the secondary take over, under realistic load, recently, then you do not have a failover. You have a hypothesis about a failover. The two feel identical right up until the moment the primary dies on its own schedule rather than yours, and you discover the secondary's certificate expired in March, or it can't actually handle peak traffic, or the thing that's meant to promote it has itself been quietly broken for a quarter.

I say this with the full weight of having been wrong about it personally. We had a database pair I'd have sworn would fail over cleanly. The first time it had to, in anger, the replica was forty minutes behind because a replication setting had drifted, and "forty minutes behind" during an incident is its own second incident. Nobody had tested it under real conditions in months. We assumed. The assumption was the bug.

The deeper problem is that untested failover doesn't just sit there inert, waiting. It quietly rots. The configuration drifts. Dependencies get added that the secondary doesn't know about. Capacity planning that was sound a year ago no longer reflects the traffic you actually carry. Every one of those changes is reasonable in isolation, and not one of them flags itself as having broken your redundancy, because nothing exercises the redundancy to notice. So the gap between "what we think will happen on failover" and "what actually happens" widens steadily and invisibly, and the only instrument that ever reads it is a real outage. That is a terrible time to take the measurement.

Recovery is the part nobody rehearses

The other lesson the airline story drives home is that getting the failed thing working again is only the start. Their flights didn't resume the instant the router was replaced. Bringing a large, stateful system back up after a hard failure is a process in its own right, and it's the process people rehearse least, because it only matters on the worst day they have.

A few things I've started insisting on, all of which are boring and none of which are clever:

  • Test the failover on a schedule, not on faith. Pick a quiet window, kill the primary on purpose, and watch. If you're nervous about doing it in production, that nervousness is the finding. It means you don't trust it, which means it isn't really redundancy.
  • Write down the recovery order, not just the recovery steps. When everything comes back at once, the order matters. Which service needs which other service healthy before it can sensibly start? Get that wrong and you spend an hour watching things crash-loop while they wait for each other.
  • Know your actual blast radius. "It's just a router" is exactly the sentence that precedes a bad week. The interesting question is never what the box does on a normal day, it's what stops working when it's gone, and whether anyone has traced that dependency chain all the way out.

The part that isn't technical

There's a final thing, and it's about honesty. Almost every one of these large outages, when the write-up eventually appears, contains a redundancy that existed on paper and not in practice. The diagram was right once. Then a config drifted, or a dependency was added, or someone made a sensible-looking change under deadline, and the safety net developed a hole that nobody had reason to look at because the net had never been needed.

The cloud outages that fill my feed every few weeks are not, mostly, stories about exotic failures. They're stories about ordinary failures meeting recovery paths that were never properly exercised. That's oddly reassuring, because it means the fix is within reach. It's just dull, and it competes for time against everything that ships features, and it only ever pays off on the day you hoped would never come.

So: this week's reminder, courtesy of a single dead router and a great many grounded passengers, is to go and break your own primary on purpose, in a window of your choosing, while you still have the luxury of choosing. Far better you find the hole in the net than the universe finds it for you at three on a Saturday morning.