Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the fencepost that three of us read and none of us saw

A pagination off-by-one that dropped one record per page, passed two reviewers and a test suite, and only surfaced when a customer counted their own rows and found one missing.

A terminal showing a paginated query with an off-by-one boundary

The bug report was the kind that makes you defensive before you have even read it properly: "your export is missing records." We export a lot of records, accurately, for a long time, and my first reaction was that the customer had miscounted. They had not miscounted. They had counted very carefully, which is more than I can say for the code, or for the three of us who reviewed it.

The export was paginated. Pull a page of results, write them out, advance, pull the next page, until there are no more. Standard, dull, the sort of loop you have written a hundred times and stopped reading because you already know what it says. That last part is the whole problem.

what the customer saw

They had 2,500 rows in their account. The export contained 2,475. Twenty-five missing, and the page size was 100. Twenty-five pages, one record short on each. That arithmetic is the kind that tells you exactly where the bug lives before you have even opened the file: it is not random corruption, it is not a timeout, it is a boundary that is off by one, repeated once per page.

Here is the offending loop, lightly disguised:

offset = 0
limit = 100
while True:
    rows = db.query(
        "SELECT * FROM records WHERE account_id = %s "
        "ORDER BY id LIMIT %s OFFSET %s",
        account_id, limit, offset,
    )
    if not rows:
        break
    write_rows(rows[1:])          # <- here
    offset += limit

There it is. rows[1:]. The first record of every page, silently dropped.

A close-up diff highlighting the rows[1:] slice

How does a slice like that get written in the first place? It was a leftover. An earlier version of the function fetched one extra row at the front as a sentinel to detect a boundary condition, and trimmed it off before writing. That logic was removed in a refactor. The trim was not. The rows[1:] outlived the reason it existed, which is the most ordinary way bugs are born: not by being written wrong, but by being correct, then having their context quietly removed around them while they stay put.

why three of us missed it

This is the part I actually want to talk about, because the fix is one character and the lesson is not.

Two people reviewed that change and approved it, and I was one of them. We are not careless. We read the diff. But a diff shows you what changed, and rows[1:] did not change in that PR, it was already there, sitting innocently in a line of context. Review draws your eye to the green and red. The bug was in the white. You can review a hundred diffs and never once question a line that nobody touched, because the entire frame of the exercise is "what is different here," not "is this whole function correct."

And the tests passed, which made it worse, because passing tests feel like proof. The test fixture had nine rows and a page size of ten, so everything fit on a single page. Drop the first row of the only page and you are down to eight, which the assertion happily accepted because the assertion checked that some rows came back and that they were well-formed, not that all of them did. The off-by-one needed more than one page to show itself, and the test never used more than one page. The bug was invisible at the exact scale the tests ran at, and obvious at the scale a customer ran at.

# the test that "passed"
rows = export_account(account_with_nine_records)
assert len(rows) > 0          # <- says nothing useful
assert all(valid(r) for r in rows)

the fix, and the actual fix

The one-character fix is to delete the slice and write rows. Done in seconds.

The fix that matters took longer and is duller. I made the test span multiple pages, with a row count that is not a clean multiple of the page size, and I asserted the exact count and the exact set of IDs, not a vague "more than zero." A pagination test that fits on one page is not testing pagination, it is testing a single query with extra ceremony. The whole point of paging is the seams between pages, and a fixture smaller than a page has no seams to get wrong.

What I took from it: off-by-one errors love boundaries that your tests never reach, and they love lines of code that your reviews never look at because nobody changed them this time. Counting is a perfectly good debugging tool, often better than a debugger. When 2,475 wants to be 2,500 and your page size is 100, the bug has already told you its shape. You just have to be willing to listen to the customer who counted before you assume they were wrong.