Ramblings of an aging IT geek
← Ramblings of an aging IT geek
personal

the book that taught me how organisations fail before software does

How How Complex Systems Fail and the wider resilience-engineering literature changed the way I run incidents, write postmortems, and think about blame.

A coffee beside a worn stack of books on a quiet morning

The book that's been rattling around my head for the last few months isn't really a book. It's a four-page paper, Richard Cook's How Complex Systems Fail, eighteen numbered points and not a wasted word, and it has done more to change how I run an incident than any runbook I've ever written. I came to it the way most engineers do, via the resilience-engineering crowd and a few thoughtful conference talks, and once you've read it you can't un-see it.

Cook is an anaesthesiologist. The paper is about operating theatres and intensive care, where the systems are people, machines, drugs and procedures, and where failure kills. None of it mentions software. All of it is about software.

The bit that rewired me

The point that lodged hardest is number seven: catastrophe requires multiple failures, single point failures are not enough. Complex systems run with latent faults present at all times. They don't take you down individually because the system is full of defences, redundancies, the operators' own vigilance. Disaster happens when several of those latent faults line up at once, and that alignment is, by its nature, unforeseeable. You cannot enumerate it in advance because if you could, you'd already have fixed it.

I read that and thought about every postmortem I'd ever sat in where we hunted for the root cause, the one line of code or the one bad config, as if the system had a single throat to choke. There is no root cause. There's a particular, never-to-be-repeated alignment of holes in the cheese, and the moment you write "root cause: human error" you've stopped learning and started filing.

Why "human error" is the wrong place to stop

Cook is brutal on this. Point twelve, roughly: human practitioners are the adaptable element of the system, the ones constantly creating safety by making thousands of small adjustments nobody records. We only notice them when one adjustment, in hindsight, looks like the cause. The operator who ran the command that took prod down was also the operator who, on a hundred other days, made a judgement call that quietly prevented an outage you never heard about because it never happened.

A wide, empty landscape that makes the day's incidents feel small

So blaming the person at the keyboard isn't just unkind, it's analytically useless. It tells you nothing you can act on, because the next person in that seat, with the same tooling and the same pressure and the same misleading dashboard, will make the same call. The interesting question is never "who ran the command" but "what made running that command, at that moment, look like the reasonable thing to do?" That's where the fixable system lives.

What it changed in practice

A few concrete things have shifted in how I work, and none of them are tools.

  • I write postmortems in the past tense and the third person, describing what was known at the time, not what we know now. Hindsight makes everyone look stupid and teaches nothing.
  • I've stopped letting "we'll add more monitoring" be the action item. Cook's point about defences is that adding a new defence also adds a new way to be surprised, more alarms, more things to ignore, a higher baseline of noise that hides the next signal. More is not free.
  • I treat the people who were on call during an incident as the experts to be interviewed, not the suspects to be questioned. They were there. They know things the timeline doesn't.

There's a related strand I've been reading alongside it, the work that came out of the same community on how teams actually maintain safety, and the through-line is the same: safety is not the absence of failures, it's a thing people actively create, every day, mostly invisibly, and your job as someone running systems is to make that work easier rather than to catch people out when it occasionally fails.

I'll be honest that this hasn't made my on-call quieter. Pagers still go at three in the morning. What it has done is change the conversation the next day. We used to ask "whose fault was it." Now we ask "what did the system look like from inside the incident, and why did the wrong thing seem right?" The first question makes people defensive and teaches you nothing. The second one occasionally teaches you something worth a sleepless night.

It's four pages. You can read it over a coffee. I'd suggest you do, and then quietly hand it to whoever in your organisation still writes "root cause: human error" at the top of the document.