I went back to a short paper this week that I'd skimmed years ago and clearly hadn't really read, Richard Cook's "How Complex Systems Fail". It's only a few pages, eighteen numbered points, and I think it's quietly rewired how I look at every outage I've ever been part of.
The line that stuck was that catastrophe requires multiple failures, that single-point failures are rarely enough on their own. I'd known that in a vague way. Seeing it stated plainly made me realise how much of my own instinct had been to hunt for the one thing that broke, the bad commit, the wrong config, the person who pushed on a Friday. There's almost never a single thing. There's a system that was running closer to the edge than anyone realised, and a small nudge that happened to land in the wrong place.
It also makes the point that the people working in the system are continuously creating safety, not just occasionally creating risk. That reframes the postmortem entirely. The question stops being "who let this happen" and becomes "what were the defences that usually catch this, and why did they all line up wrong this time".
I'm not going to pretend one short paper made me a better engineer overnight. But it gave me a vocabulary for something I'd felt for years and couldn't name, and the next time I'm in a blameless postmortem trying to keep it actually blameless, I'll have eighteen good reasons to hand. Worth the half hour. Possibly worth a re-read every January.