three days for a bug that only happened when nobody was looking

A terminal showing interleaved log lines from two workers

The report was vague in the way the worst ones always are: "sometimes a job runs twice." Not often. Not reproducibly. Never in staging, where we had two workers and a polite trickle of traffic. Only in production, where we had twenty workers and a queue that occasionally got hammered. That gap between environments is usually the whole story, and it was here too.

The job processor did the sensible-looking thing. It pulled a job, checked a status column to see if the job was already running, and if not, marked it running and got on with it. Read the status, decide, write the status. Stated like that, the bug is obvious. Stated in actual code, spread across a repository method and a service method and a helper, it was invisible to me for two days.

job = repo.get(job_id)
if job.status != "running":      # check
    repo.set_status(job_id, "running")  # act
    process(job)

The trouble is the gap between the check and the act. Two workers pull the same job within a few milliseconds of each other. Both read pending. Both decide it's theirs. Both set it to running. Both process it. Nothing is locked, nothing is atomic, and the database is perfectly happy to let two transactions both win a race that neither of them knew they were in. Under low concurrency the window is so small you basically never hit it. Under load you hit it just often enough to ruin a morning.

Two columns of code with the check-then-act gap highlighted

I lost most of the first two days to the wrong layer entirely. I was convinced the queue was redelivering messages, because that's a famous way to get double processing, so I spent an embarrassing amount of time reading broker docs and tuning visibility timeouts. The queue was fine. The queue was delivering each job exactly once. It was my own code that was racing two workers against a shared row, and no amount of broker tuning was going to fix a check-then-act gap in the application.

The fix was small and slightly humbling. Make the claim atomic and let the database arbitrate, which is the one thing it's genuinely good at:

UPDATE jobs SET status = 'running', worker = :id
WHERE id = :job_id AND status = 'pending';

If that statement updates a row, you own the job. If it updates zero rows, somebody else got there first and you move on, no harm done. One round trip, no window, no two workers both convinced they're alone. The conditional update is the lock, and the database does the hard part.

The lesson I keep relearning: any time I find myself reading a value, thinking about it, and then writing a value back, I should assume someone else is doing the exact same thing at the exact same instant, because under real load they are. Staging will never show you this. Staging is too quiet to be honest. Production is where you find out whether your "check, then act" was actually one operation or two.