Ramblings of an aging IT geek
← Ramblings of an aging IT geek
personal

the safety book that made me kinder about outages

How Sidney Dekker's Field Guide to Understanding Human Error changed the way I run incident reviews and think about blame in complex systems.

A coffee and a stack of books, a quiet Saturday read

Most of the books that change how I think about systems are about computers. This one is not, and it has done more to change how I run an incident review than anything technical I have read. Sidney Dekker's The Field Guide to Understanding Human Error is about aviation, surgery, and the messy places where humans operate complicated machines, but every page of it applies to a 3am page and a postmortem doc.

The central idea is deceptively simple, and once you have it you cannot run a blameless review without it. There are two ways to look at human error. The old view treats error as the cause: the operator made a mistake, the operator was careless, find the bad apple and remove it. The new view treats error as a symptom: the operator did something that made complete sense given the information they had, the pressure they were under, and the system they were sitting inside. The job is not to find who was stupid. It is to understand why a reasonable person did a reasonable thing that turned out badly.

hindsight is the enemy

The part that genuinely shifted my thinking is what Dekker says about hindsight. After an outage, we know how it ended. We know which alert mattered and which warning was real. So we look back at the engineer who ignored that warning and ask, incredulously, how they could possibly have missed it.

But they did not have our hindsight. At the time, that warning was one of forty, most of them noise, on a dashboard they had learned to half-ignore precisely because it cried wolf so often. The "obvious" signal was only obvious once you knew the ending. Dekker calls this the hindsight bias, and learning to spot it has changed how I read every timeline. The question is never "why did they ignore the alert". It is "what made ignoring the alert the sensible thing to do in the moment", because answering that tells you something you can actually fix.

A wide open landscape, the long view you need after a hard week

blame is a comfort, not a cause

The other thing the book takes apart, gently but completely, is how satisfying blame is. Finding the person responsible feels like closure. It lets everyone else relax, because the problem has a name and the name is not theirs. And it is almost always wrong, not because nobody made a mistake, but because the mistake was made possible by the system. Remove that one person and the next reasonable human in the same seat, under the same pressure, with the same misleading dashboard, will make a version of the same mistake. You have fixed nothing. You have just used up some goodwill.

I think about this every time I write up an incident now. If the conclusion of my postmortem is "person should have been more careful", I have not finished the analysis. That is a place to start asking why, not a place to stop. What did the system make easy that should have been hard? What did it make hard that should have been easy? Where was the person set up to fail and then blamed for falling?

It is not a long book and it is not a difficult one, though it is the kind that makes you slightly uncomfortable about reviews you ran years ago. If you are anywhere near on-call, or you write up failures, or you sit in the meetings where someone decides whose fault it was, read it. It will make you better at the job and, I think, a bit kinder while you do it. That is not a bad thing to get out of a Saturday afternoon.