the cron job that ran twice and told no one

A bug on a terminal screen

Some records were being processed twice. Not many, and not every night, which is the worst kind of bug because it lets you convince yourself you imagined it. Invoices with a duplicated line. A counter that occasionally jumped by two. The job logged a clean, successful run every single time, so the logs were no help at all. They told the truth: the job had run successfully. They just didn't mention it had run successfully on two machines at once.

Code on a screen during debugging

The job had been moved between hosts months earlier, during some migration nobody fully remembered. The crontab was duly added to the new host. The crontab was not removed from the old one, because the old host was meant to be decommissioned and never quite was. So there it sat, still alive, still booting, still running the nightly batch at 02:00 against the same shared database the new host used. Two crons, two hosts, one database, no lock.

Most nights they didn't actually collide on the same records, because the job grabbed a batch and the timing happened to interleave harmlessly. But when both started within the same second and both SELECTed the same pending rows before either had marked them done, both processed them. No error, no warning, two perfectly successful runs. The double-processing was a race, which is why it was intermittent and why it looked like a ghost.

Finding it was embarrassingly simple once I stopped trusting the application logs and looked at the database. The audit table had two rows for the same batch, microseconds apart, stamped with two different hostnames. I'd been staring at one host's logs the whole time. The second hostname in that column was the entire mystery solved in a single line.

The immediate fix was to crontab -r on the zombie host and then, properly, to actually decommission it so it couldn't rise again. But the real fix is to stop relying on "there's only one of me" as a correctness guarantee, because there is never reliably only one of you. The job now takes an advisory lock before it does anything:

SELECT pg_try_advisory_lock(42);  -- returns false if someone else holds it

If the lock isn't granted, the job logs "another instance is running" and exits cleanly. Now it does not matter how many hosts fire it off, how many stray crontabs survive the next migration, or how cleverly two starts interleave. Exactly one wins the lock and does the work; the rest bow out. A batch job that mutates shared state should assume it has a twin, because one day, quietly, it will.