the cron job that ran twice and said nothing

A terminal showing a bug being traced

The symptom was duplicate rows. Not every night, maybe one night in ten, a batch job that aggregates the previous day's data would produce two identical runs a few seconds apart. No errors, no failed lock, nothing in the log to suggest anything had gone wrong. Just two sets of output where there should have been one.

My first assumption was that something upstream was triggering it twice, because that is always the comfortable assumption: the bug is somebody else's. I added a line to the script that logged the PID and the exact start time, and waited for it to happen again.

When it did, the two log lines were three seconds apart and had different PIDs. So it really was cron starting the job twice, as two separate processes. Cron does not do that on its own, which sent me to the crontab to look for a duplicate entry I had fat-fingered. There wasn't one. The line was there exactly once.

Source code on a screen

The thing I had not clocked is that this box runs in a timezone that observes a clock change, and the job is scheduled for an hour that, on one specific morning of the year, happens twice. The crontab said 30 1 * * *. On the night the clocks went back, 01:30 local time occurs once before the change and once after, and Vixie cron on this distribution will happily fire the job at both. An hour of wall-clock time is replayed, and anything scheduled inside it runs again.

That explained the rare duplicate, but not why it was happening in July, nowhere near a DST boundary. So that was a red herring I had constructed for myself, and a good reminder to confirm a theory before getting attached to it.

The real cause was duller and entirely my fault. The job had been migrated to a new host, and the old host's crontab had been copied across with it, but the old host had not been decommissioned. Both machines were running the same job against the same database, on the same schedule, drifting a few seconds apart because their clocks were not perfectly synced. Two hosts, one job, no coordination. The "one night in ten" pattern was just the two of them occasionally lining up closely enough that I noticed the pair.

The fix was thirty seconds of work: disable the crontab on the retired host and actually turn the thing off. The lesson was longer. The job had no notion that it might not be alone. A flock on a shared lock file, or a row in a table claiming the run, would have made the double execution impossible regardless of how many hosts thought they owned the schedule. I added the flock, because the next migration will go the same way, and future me does not deserve to lose another evening to it.

30 1 * * * /usr/bin/flock -n /var/lock/nightly.lock /opt/jobs/aggregate.sh

Silent duplicate work is the worst kind of bug. Nothing crashes, nothing pages you, the data just quietly goes wrong in a way you only notice weeks later when a total doesn't add up. Make the job loud about being alone, and you never have to debug the silence.