Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the off-by-one that three of us approved

A paging bug that dropped one record at the boundary of every page made it through three approvals because the diff looked obviously correct, and obvious is exactly where these hide.

A terminal showing a list of records with one quietly missing

Support flagged that a customer was "missing some records" from an export. Not all of them. Not a clear chunk. Just, here and there, one would be absent. The customer had spotted it because they reconciled against their own system and the totals were off by a handful. We had not spotted it at all, because a handful of missing rows out of tens of thousands looks exactly like nothing on a dashboard.

The export paged through the data, a thousand rows at a time, and wrote each page to the file before fetching the next. Standard stuff. The kind of loop you have written a hundred times and stopped reading carefully years ago. Which is the problem.

Here is the shape of it, simplified:

offset = 0
page_size = 1000
while True:
    rows = fetch(offset, page_size)
    if not rows:
        break
    write(rows)
    offset += len(rows) + 1   # the bug

That + 1 is the whole story. Someone, at some point, had convinced themselves the offset needed to skip past the last row they had just read, so it would not be fetched again on the next page. It does not. OFFSET is already the count of rows to skip, so advancing by the page size lands you precisely on the next unseen row. The extra one skips a record at every single page boundary. One missing row per thousand, invisible in aggregate, infuriating to whoever was reconciling.

A close-up of the diff that looked completely fine

What stuck with me is not the bug. It is that three of us approved the change that introduced it, me included. I went back and looked at the original review. The diff was small, the variable names were sensible, the intent was clear, and the + 1 had a little comment next to it explaining what it was for. That comment is exactly why it sailed through. It told a tidy, plausible story, and we all read the story instead of the arithmetic.

That is the trap with off-by-ones generally. They live in code that looks obviously right, and "obviously right" is precisely the code a reviewer skims. Nobody catches the boundary bug in the gnarly forty-line function, because everyone reads that one slowly and suspiciously. They catch it in the four-line loop that any fool can see is correct, except it is not, and the confidence is the camouflage.

The fix was deleting one character. The lesson cost rather more than one character. We added a test that exports a dataset spanning several page boundaries and asserts the row count comes back exactly whole, which would have failed instantly on the buggy version and now guards the boundary directly rather than trusting anyone to eyeball it. And I have a new reviewing habit, which is that whenever I see a + 1 or a - 1 next to a comment explaining why it is needed, I treat the comment as a confession and check the maths myself. More often than I would like, the comment is the bug wearing a high-vis jacket.