the overlapping cron job that ate its own tail

A terminal showing a tangled cron log

This one didn't announce itself as a cron problem at all. It announced itself as corrupted output from a sync job that had run perfectly well for the best part of a year. Files half-written, a few records mangled, no errors logged, and the job exiting zero as if nothing was wrong. The thing about jobs that have worked for a year is that you stop watching them, so by the time anyone noticed, it had been quietly producing the odd bad file for days.

The job ran every fifteen minutes. It pulled data from an upstream source into a working directory, transformed it, and wrote the result out. Fifteen minutes had been wildly generous when it was written. The job took ninety seconds. Plenty of headroom.

Except the dataset had grown. Slowly, the way these things do, a percent or two a week, nobody watching the runtime. Over a year ninety seconds had crept up to fourteen minutes on a good run, and on a bad run, when the upstream was slow, past fifteen. And the moment a run took longer than fifteen minutes, cron started the next copy while the first was still going. Two copies, sharing one working directory, reading and writing the same temporary files. They trampled each other. Hence the corruption, hence the "sometimes", hence a year of silence followed by a mess.

Apr 12 14:00:01 sync CRON[3110]: (svc) CMD (/usr/local/bin/sync.sh)
Apr 12 14:15:01 sync CRON[3998]: (svc) CMD (/usr/local/bin/sync.sh)
Apr 12 14:16:42 sync sync.sh[3110]: wrote 41203 records   # the 14:00 run, finishing late

The 14:00 run finished at 14:16, after the 14:15 run had already started. For a minute and a bit there were two of them in the same directory. That overlap was the whole bug.

Code on a monitor showing a shell script

There are two failures stacked on top of each other here, and they're worth separating because they need different fixes.

The first failure is that the job could overlap at all. A scheduled job that uses a shared working directory has no business running two copies, and "the interval is longer than the runtime" is not protection, it's an assumption that rots the moment the data grows. The fix is the same lock I reach for every time:

exec flock -n /var/lock/sync.lock /usr/local/bin/sync-impl.sh

-n for non-blocking, so if the previous run is still going the new one exits immediately and cleanly rather than queueing up to make things worse. A skipped run every now and then is fine; the next one fifteen minutes later catches up. Two runs in the same directory is not fine. flock makes the bad case impossible instead of merely unlikely.

The second failure is the one I'm crosser about: nobody knew. The job had been creeping towards the cliff for months and there was no signal. Cron will happily email you the output of a job if you let it, and most people, including past me, disable that the moment the noise gets annoying and then fly blind forever. So two things went on every scheduled job after this:

[email protected]
*/15 * * * * svc /usr/local/bin/sync.sh

MAILTO means if the job writes anything to stderr, or exits non-zero, someone hears about it. And inside the script, a crude runtime guard that complains when a run takes suspiciously long:

start=$(date +%s)
run_the_actual_work
elapsed=$(( $(date +%s) - start ))
if [ "$elapsed" -gt 600 ]; then
  echo "sync.sh took ${elapsed}s, approaching the 900s interval" >&2
fi

Ten minutes against a fifteen-minute interval is the point at which I want a nudge, not the point at which it breaks. That stderr line, via MAILTO, turns "the runtime is creeping" from something nobody measures into an email that lands months before the cliff.

A second screen of terminal output showing a timing log

The general lesson, again, because cron keeps teaching me the same one in new outfits: a schedule is a promise about when a job starts, not a promise that the previous one has finished. If two copies running at once would cause damage, lock so they can't. And give every scheduled job a mouth, via MAILTO and a sanity check or two, so that when it starts drifting towards trouble it can tell you, rather than spending a year quietly corrupting your data and waiting for a human to stumble across it.