Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the off-by-one three people signed off on

A pagination bug that dropped one record per page, traced to an inclusive-versus-exclusive range mismatch that passed code review because the test data was too small to expose it.

A terminal showing a paginated query result

The bug report was a single line: "the export is missing rows". Not all of them, which would have been easy. Just some. A customer with a few thousand records had run an export, counted the output, and found it short by a couple of dozen. The number changed depending on how many records they had, which is the kind of detail that should have told me immediately what was going on, and didn't.

The export paginated. It pulled records in batches of fifty, processed each batch, asked for the next, and stopped when a batch came back empty. Standard stuff, written months earlier, reviewed by three people including me, running happily in production ever since. The pagination used an offset and a limit, and computed the offset for each page from the page number.

def page_bounds(page, size=50):
    start = page * size
    end = start + size
    return start, end  # passed to a slice: records[start:end]

Stare at that for a second. It is correct. records[0:50], then records[50:100], and so on, clean exclusive upper bounds, no overlap, no gaps. The slice version was fine. The problem was that this code didn't slice a list. It built a SQL WHERE clause, and somewhere along the way someone had translated the exclusive end into an inclusive BETWEEN.

-- what the slice-shaped code intended:  start <= id < end
-- what the query actually did:          BETWEEN start AND end-? 
WHERE id BETWEEN :start AND :end

BETWEEN in SQL is inclusive on both ends. So page zero asked for ids 0 to 50, and page one asked for ids 50 to 100. Id 50 was fetched twice, once on each page. The dedup step downstream then quietly threw one copy away. One record per page boundary, swallowed without complaint. With a hundred records you lose one or two and nobody notices in testing. With three thousand records you lose sixty, and a customer counts them.

It survived review for the most ordinary reason there is: the test fixture had thirty rows. Thirty rows is less than one page. The pagination code never ran its second iteration in any test, so the boundary that contained the bug was never exercised. Three of us read the code, agreed the arithmetic looked right, and none of us noticed that "looks right for a list slice" and "is right for an inclusive SQL range" are different claims. The review checked the shape of the code, not the semantics of the boundary, because the only place the boundary mattered was a value of end that the tests never reached.

The fix was one word, BETWEEN start AND end became id >= start AND id < end, and the export came out whole. But the fix I actually cared about was the test. The fixture now has enough rows to force at least three pages, and there's an explicit assertion that every source record appears in the output exactly once. Off-by-ones don't hide from a test that crosses the boundary. They only hide from tests too small to reach it, which, going by the size of most fixtures I've written, is most of them.