Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the off-by-one three of us read and approved

A pagination bug that dropped exactly one record per page, how it sailed through two approving reviewers, and why the test that should have caught it was looking at the wrong thing.

A terminal with a paginated query and a quietly missing row

A support ticket said a customer's export was "missing a few rows". Not all of them, not half, just a few, scattered. That word "few" is what made it interesting, because random loss usually means something timing-related and frightening, whereas a fixed loss usually means arithmetic, which is at least tractable. This turned out to be arithmetic.

The export paginated through results in pages of 100. The query looked like this, more or less:

offset := page * pageSize
rows := db.Query("SELECT ... ORDER BY id LIMIT ? OFFSET ?", pageSize, offset)

Looks fine. It is fine, mostly. The bug was not here. It was in the loop that decided when to stop fetching pages, which checked len(rows) > pageSize to decide whether another page existed. It never could be greater than the page size, the query had a LIMIT of exactly pageSize. So the "is there a next page" check used a fence-post one off from reality, and on the boundary where a page came back exactly full with more data waiting behind it, the loop stopped one page early. Not one row. One page. The "few" the customer saw were the records that happened to fall at the tail of an exactly-full final fetch.

A close-up of the pagination loop and the comparison that lied

So how did this get through review? Two of us approved it, and I was one of the two. I have thought about that more than I have thought about the fix, because the fix took five minutes and the review failure is the part that will happen again.

The honest answer is that we both read the query, nodded at the query because the query was correct, and skimmed the loop because loops that fetch pages all look the same. The bug lived in the gap between two correct-looking things. The query was right. The loop shape was right. The comparison inside the loop was wrong, and it was wrong in a way that only mattered on an exact boundary, which is precisely the case a human eye glides over and an exhausted reviewer assumes someone else checked.

The test made it worse, not better. There was a test, and it passed, because it used a fixture of 250 rows. 250 is not a multiple of 100. The off-by-one only bites when the total is an exact multiple of the page size, so the one fixture we had was the one case guaranteed never to trigger it. A green test gave everyone, reviewers included, permission to stop thinking.

The fix was the boring correct comparison:

// fetch the page, then ask: did we get a full page back?
// if so, there may be more. if not, we're done.
if len(rows) < pageSize {
    break
}

And the real fix, the one that matters: a test with exactly pageSize and exactly 2 * pageSize rows, the boundaries where pagination arithmetic actually lives. We had been testing that pagination returned data. We had never tested that it returned all the data on the one count that exposes the edge.

I no longer trust a pagination test that does not use a multiple of the page size. And I have stopped approving loops by their shape. The shape is never the bug. The bug is always the comparison three lines in that everyone trusts because the shape around it looks so familiar.