Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

three days for a missing lock

A flaky failure that only showed up under load turned out to be a classic check-then-act race, and the three days it took to find were mostly spent doubting the evidence.

A terminal showing a bug

It took three days. The fix was four lines. That ratio is the whole story of debugging, really, and I'm writing it down partly as a postmortem and partly as therapy.

The symptom was a job that occasionally processed the same record twice. Not often. Maybe one run in a few hundred, and only on the busy service, never on staging where I could actually watch it. That "never on staging" should have told me something immediately, because the difference between staging and production wasn't the code, it was the concurrency. Staging ran one worker. Production ran twelve.

A close up of source code on a screen

I spent the first day looking in the wrong place entirely. I was convinced it was the queue redelivering messages, because that's the obvious culprit and I'd been bitten by it before. I added logging, I checked ack timings, I read the broker docs more carefully than I'd ever wanted to. The queue was innocent. Each message was delivered exactly once. The duplication was happening inside our own code, after delivery, which is a much more uncomfortable thing to discover.

Day two was spent not believing my own logs. I'd added a log line that printed the record id and the worker id at the moment of claiming a record, and there it was in black and white: two different workers, same record id, timestamps a few milliseconds apart. The pattern was textbook. We checked whether a record was claimed, then claimed it, in two separate statements. Between the check and the act, another worker did the same check, saw the same "not claimed", and off they both went. Classic check-then-act, the oldest race in the book, and I'd written it myself eighteen months ago without a second thought.

The reason it took a third day is that I didn't trust the simple explanation. It felt too easy. So I went looking for something cleverer, some subtle reordering or a caching layer lying to me, and there wasn't one. The boring answer was the right answer. The select and the update needed to be one atomic operation, so I moved the claim into a single statement that updates the row to "claimed by me" only where it's currently unclaimed, and uses the affected-row count to decide who won. One writer wins, everyone else gets zero rows back and moves on.

UPDATE jobs
SET claimed_by = :worker, claimed_at = now()
WHERE id = :id AND claimed_by IS NULL;

No duplicates since. The lesson I keep relearning is that races hide in the gap between two statements that look like one thought in your head. The database doesn't share your head. And the second lesson, the one I actually want to remember, is to suspect the simple explanation first and only reach for clever once boring has genuinely been ruled out. I reached for clever on day one and it cost me two.