the cron job that ran twice and told nobody

A bug crawling across a terminal screen

A nightly batch job was, every so often, processing everything twice. Not every night. Maybe one night in five. The symptom was duplicate rows in a downstream table and a quietly inflated count that someone in finance, bless them, noticed before I did. The job itself logged a clean single run every time, which is what made it maddening.

The job is idempotent in theory, in the way that things are idempotent in theory and then are not. It reads a queue, processes the batch, writes results, marks the batch done. If it runs once, fine. If it runs twice with enough of a gap, the second run picks up records the first had not yet marked, and you get duplicates without either run ever seeing anything wrong. Each run is individually correct. There were just sometimes two of them.

A snippet of cron and log output

The logs from the application were no help, because from its point of view nothing was wrong: it ran, it succeeded. So I went up a level to cron itself and turned on the logging that should have been on from the start. On this host that meant making sure cron's own output went somewhere I could read it, and then waiting for a bad night.

The bad night came and the syslog told the story immediately:

Jan 22 02:00:01 box-a CRON[20413]: (deploy) CMD (/opt/app/run-nightly.sh)
Jan 22 02:00:01 box-b CRON[18890]: (deploy) CMD (/opt/app/run-nightly.sh)

Two hosts. Same command, same minute, same user. The job was not running twice on one machine. It was running once on each of two machines, both of which believed they were the one that should be doing it.

The cause was a migration I had half finished months earlier. We were moving the batch workload from box-a to box-b, and as part of that the crontab had been copied across to box-b so it was ready to take over. The cutover, the bit where you remove the entry from the old box, never happened. So both boxes sat there with an identical entry, firing at 02:00 every night. Most nights one finished fast enough that the queue was empty before the other got to the interesting part, and you saw nothing. On a slow night, both got a slice of real work, and you got duplicates.

The fix was the obvious one: pick a winner, remove the cron entry from the loser, done. But the more useful fix was admitting that "this job must only ever run in one place" was an assumption living entirely in my head, enforced by nothing. So the job now takes a lock before it does anything, a simple advisory lock in the database keyed on the job name. If a second instance starts, anywhere, it fails to take the lock and exits without touching a thing. flock on a shared filesystem would do the same job for something simpler.

The lesson I keep relearning is that the application logs tell you what the application did, not what the system did to it. When the app insists it behaved perfectly and the outcome says otherwise, the truth is usually a layer down: cron, systemd, the scheduler, the thing that decided to run your perfectly correct program one more time than you meant it to. And a mutex is cheap insurance against a migration you were certain you had finished.