Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the cron job that ran twice and never told me

A nightly batch job that occasionally double-processed records, traced to two hosts running the same crontab, and the simple lock that should have been there from day one.

A terminal showing a cron log and duplicate entries

The bug report was vague in the way the worst ones are: "some of the figures look doubled". Not all of them, not every night, just occasionally a record that had clearly been processed twice. No errors anywhere. The job exited zero every time. Whatever was wrong, it was wrong quietly, which is the kind I find hardest because there's nothing to grep for.

The job was a nightly aggregation, kicked off by cron at 02:00. First thing I did was stop trusting the assumption baked into the report, that it ran once. I added a line at the top of the script to log a timestamp, a hostname, and the PID to a file the script appended to rather than overwrote. Then I waited a couple of nights.

2017-07-09 02:00:01 host=batch-a pid=11874 starting aggregation
2017-07-09 02:00:01 host=batch-b pid=9921  starting aggregation

There it was. Two hosts. The job wasn't running twice on one machine, it was running once on each of two machines that, somewhere in a half-remembered HA effort, had both ended up with the same crontab. Most nights the timing was different enough that one finished and marked records done before the other started, so no harm. On the nights they overlapped, both read the same not-yet-processed batch and both aggregated it. Hence "occasionally", hence "doubled".

A terminal showing two hosts logging the same job start time

The real fault is older than the duplicate crontab, though. The job had no business being safe to run twice and assuming nobody would. A batch job that mutates shared state should hold a lock, full stop, whether or not you think two copies can ever start. The fix that actually closes the hole is flock, which is on every box already and turns this into a one-line wrapper:

flock -n /var/lock/aggregate.lock /usr/local/bin/aggregate.sh || exit 0

-n means non-blocking: if another process already holds the lock, the second one gives up immediately and exits cleanly rather than queueing up to do the damage a minute later. Because the lock is taken on a real file and released when the process dies, it survives crashes and doesn't leave a stale lockfile to clean up, which is the failure mode of the hand-rolled "touch a pidfile" approach everyone writes once and regrets.

I did fix the duplicate crontab too, obviously, the second host had no reason to be running it. But I left the flock in. The crontab problem was a mistake someone made and someone can make again; the lock is the thing that means it stops mattering. The lesson I keep paying for: "this only runs in one place" is a statement about today's configuration, not a property of the job. If running it twice would corrupt data, make running it twice impossible, and don't rely on the schedule to do it for you.