Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the code was fine, i was wrong

A short note on a debugging session where the actual fault was a thing I had assumed to be true and never checked, not the code I spent two hours staring at.

A terminal full of log output mid-debugging session

I lost most of an afternoon to a service that "obviously" had a bug, only to find the code was doing exactly what I told it to. The fault was upstream of the code entirely: it was in something I had decided was true and never bothered to confirm.

The symptom was a request occasionally returning stale data. I assumed the cache was misbehaving, because that is what caches do, so I went straight at the cache. I read the eviction logic three times. I added logging. I wrote a test that should have reproduced it and stubbornly would not.

The thing I had assumed, and never checked, was that there was one instance of the service. There were two. A deploy months earlier had scaled it up, and one of the two had an older config pointing at a cache that no longer received writes. Half the requests were correct, half were stale, and which half depended entirely on which instance the load balancer happened to pick.

No amount of reading the cache code would have found that, because the code was right. My mental model was wrong. I had a fact in my head, "there is one of these", that had quietly stopped being true while I was not looking.

I now try to write down the assumptions before I touch the code. Not the clever ones, the boring ones. How many instances. Which config they actually loaded. What the load balancer is doing. Half my hardest bugs have turned out to be a thing I was so sure of that I never thought to look at it.