Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

a power blip grounded an airline, and we keep not learning

The British Airways bank-holiday IT meltdown was a power and failover failure, not a freak event, and the lessons are the boring ones we always skip.

A tech news headline graphic

Over the late May bank holiday weekend, British Airways grounded its entire operation at Heathrow and Gatwick. Tens of thousands of passengers stranded, flights cancelled across the bank holiday, the lot. The cause, as it came out over the following days, was not a cyberattack and not some exotic software bug. It was a power supply problem at a data centre, and then the failover not doing what failover is for. A power blip took down a flag carrier.

I want to be careful here, because it's very easy to sit in a comfortable chair and dunk on people who were having an extremely bad weekend. I've been on the wrong end of an outage at three in the morning and it is no fun at all. So this isn't a dunk. But it is the same lesson we keep being handed and keep declining to read, so let's read it.

Resilience is a property of the whole system, not a box you buy

The detail that matters is that there reportedly was redundant power, and a backup, and the failover itself is what caused or compounded the damage. That's the part everyone should sit with. The expensive redundant kit was in place. It just didn't behave the way the diagram on someone's wall said it would.

This is the single most common shape of a serious outage. It is almost never "we had no redundancy". It is "we had redundancy and it had never actually been exercised under real failure, so when the real failure came, the failover was the thing that broke". A backup generator you've never load-tested is a decoration. A standby database you've never failed over to is a liability with a comforting name.

A city skyline

The boring lessons, again

None of what follows is clever. That's rather the point. The clever stuff is mostly solved. It's the boring stuff that takes airlines off the board.

  • Test your failover by actually failing over. Pull the plug, on purpose, on a schedule, when you're watching. If you have never once cut the power to the primary and watched the secondary take the load, you do not have a secondary. You have a hope.
  • Recovery is the hard part, not detection. Lots of the damage in these big outages happens during the recovery, when systems come back in the wrong order, or all at once, or with stale state, and the thundering herd of everything reconnecting at the same instant finishes off whatever the original fault started.
  • Complexity is the enemy of recovery. The more bespoke the failover dance, the more steps there are to go wrong at the worst possible moment, with the most tired people, under the most pressure. Simple, well-rehearsed, slightly dumb recovery beats clever fragile recovery every single time.
  • Most of this is organisational, not technical. The reason these tests don't happen is rarely that nobody knows how. It's that deliberately breaking production to prove it heals is terrifying, it needs a maintenance window, it needs sign-off, and it's much easier to assume the redundancy works and get on with shipping features. Until a bank holiday Saturday, when it doesn't.

Why I'm writing this down, again

I've written some version of this post after some version of this outage more times than I'd like. AWS has fallen over and taken a chunk of the web with it. Providers have had region-wide bad days. Now it's an airline's own data centre and a power fault. The headline brand changes; the post-mortem barely does.

The uncomfortable bit, for me, is that my own homelab fails the test I'm preaching. I have backups I last restored from months ago, and a "redundant" setup I've never deliberately broken to check. So I'll do the thing I always say after these: I'll go and pull a plug I'm scared to pull, on a quiet evening, while I'm watching, and find out whether the diagram on my wall is true. That, not the new clustered whatever, is the upgrade that actually keeps you flying.