The bug was three characters wide and it got past four people, me included. It sat in production for the better part of a month before anyone noticed, and the only reason it surfaced at all was that a customer's dataset happened to land on an exact boundary.
The code was a pagination helper. Fetch a page of results, and if there might be more, fetch the next one. The condition for "there might be more" was the problem:
for offset := 0; offset <= total; offset += pageSize {
rows := fetch(offset, pageSize)
process(rows)
}
Look at the <=. When total is an exact multiple of pageSize, that loop runs one extra time, with an offset that sits right at the end of the data. Most of the time fetch returned an empty page and process did nothing, so it was invisible. The waste of one extra query nobody was going to spot in a review.
But this particular process wasn't a no-op on empty input. It wrote a summary record, and that summary assumed at least one row. On the empty trailing page it wrote a record with a count of zero and a null where there should have been an id, and that null tripped a constraint downstream in a job that ran nightly. So the symptom turned up twelve hours later, in a different system, with no obvious connection to the pagination code. Lovely.
The thing that gets me is how it survived review. We all read <= and our brains autocorrected it to "loop until we've covered everything", which is exactly the intent, just expressed wrongly. An off-by-one is the kind of bug your eyes slide over because the code looks like it's doing the right thing. It reads correctly even when it isn't.
The fix was a single character, <= to <, plus a guard so an empty page never wrote a summary at all. But the real fix was the test. We didn't have a case where the total was an exact multiple of the page size, so the boundary was never exercised. I added that case, then a couple either side of it, because boundaries come in threes: one under, exactly on, one over.
I've stopped trusting myself to spot off-by-ones by reading. They're a category of bug that's cheap to test and expensive to eyeball, so now when I see a loop with a bound and an index I write the boundary test first and let the machine tell me. Reviewers are good at "is this the right approach". They're rubbish at counting, and so am I.