Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

The Off-By-One That Three of Us Approved

How a classic off-by-one error slipped past two reviewers and me, and what it took to finally see it.

A bug on a terminal screen

The bug was a single character. The cost was a week of intermittent, maddening reports that a batch job was occasionally skipping the last record, and only sometimes, and only in production. It had been reviewed by two people plus me, and all three of us signed it off without a flicker of doubt.

Here's the offending loop, near enough:

for i in range(0, len(items) - 1):
    process(items[i])

Read it quickly and it looks fine. It looks like exactly the sort of careful boundary handling you'd want. The - 1 reads as defensive, as if the author was being thoughtful about the edge. That's the trap. There was nothing to defend against. range(0, len(items)) was correct, and the - 1 quietly dropped the final item every single time.

So why didn't it fail loudly? Because most of our test fixtures and most real batches ended on a record that didn't matter much, or got picked up by the next run, or was a duplicate of something already processed. The failure was real and constant, but its visible effect was intermittent. That gap between "the bug always happens" and "someone notices" is where these things hide.

Lines of code on a monitor

What got it past review is more interesting than the bug itself. The - 1 looked deliberate. None of us asked "why is that there", because it pattern-matched to caution, and you don't tend to challenge code that looks careful. We were reviewing for plausibility, not for correctness. Those are not the same thing, and I'd quietly conflated them for years.

I found it in the end by doing the least clever thing available. I stopped reading the code and started counting. I added a log line for the length of items and a log line for the number of times process was called, ran it against a known batch, and the two numbers were off by exactly one. After that it was obvious, the way these always are once the count gives you somewhere to look.

The fix was deleting two characters. The lesson cost rather more.

What I changed afterwards wasn't a process, it was a habit. When I see an arithmetic adjustment in a loop bound, a - 1, a + 1, an off-by-something that's meant to be clever, I now stop and ask what it's for. If the answer isn't immediate, that's the line to check, not skip. The careful-looking code is exactly the code that gets waved through, which makes it the best place for a bug to live.