Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the cron job that ran twice and told nobody

A nightly job that overlapped with itself, double-processed a queue, and left no error behind, plus the locking and timing fixes that stopped it recurring.

A terminal showing a cron schedule and logs

Some bugs announce themselves. This one was the opposite. There was no error, no failed exit code, no alert. Just a quiet report from a colleague that a handful of records had been processed twice, and a downstream system that had politely sent two emails to the same customer. No stack trace, no crash, nothing in the dashboard the colour red. Those are the worst ones, because the system is working exactly as designed, and the design is wrong.

The job in question was a nightly batch that drained a queue, did some work per item, and marked each item done. Simple. It had run every night for a year without complaint. So the first thing I did was distrust the word "every night without complaint", because a job that fails silently looks identical to a job that succeeds.

reading the logs that did exist

There were logs, just not error logs. The job logged a start line and a finish line with a timestamp, and once I lined up a week of them, the shape of the problem appeared. Most nights the job started, ran for eight or nine minutes, and finished. But on the night in question, there were two start lines a few minutes apart, and they overlapped.

03:00:01 batch: starting, 4120 items
03:00:01 batch: starting, 4120 items
03:08:44 batch: finished, 4120 processed
03:11:02 batch: finished, 4120 processed

Two starts. Both saw the same 4120 items, because they read the queue at almost the same moment, before either had marked anything done. Both processed the lot. Every item got handled twice, and from the job's own point of view, both runs were a complete success. The "processed twice" wasn't a bug in the work; it was two correct runs of the work, stacked on top of each other.

why did it start twice

This is where it got slightly silly. The crontab had one entry. I checked it three times. One line, 0 3 * * *, no duplication. So where did the second invocation come from?

The answer was a deployment. The job ran on two hosts behind a config that was meant to designate one of them as the "primary" that actually ran scheduled work. That config was a file on disk, and a deploy a few days earlier had, through a templating mistake, marked both hosts as primary. So both crons fired at 03:00, both jobs hit the same shared queue, and the system happily did everything twice. The crontab was innocent. The thing that decided whether to run was broken, and nothing checked it.

the real flaw: no mutual exclusion

Here is the uncomfortable part. The duplicate-primary config was the trigger, but it was not the bug. The bug was that the job had no defence against running twice at all. It assumed it was the only one. A batch job that drains a shared queue and assumes exclusivity is a job waiting for exactly this morning. The config mistake just supplied the occasion.

So the fix was not "make sure only one host is primary", although I fixed that too. The fix was to make the job safe to invoke twice, because sooner or later it would be, whether through a deploy, a manual run, a retry, or a cron daemon having a funny five minutes after a clock change.

The cheapest correct answer is a lock the job must hold to run. flock does this without any application code:

0 3 * * * /usr/bin/flock -n /var/lock/batch.lock /usr/local/bin/nightly-batch

-n means non-blocking: if the lock is already held, the second invocation gives up immediately rather than queuing behind the first. So even with two crons firing, only one job runs; the other notices the lock, exits, and that is the end of it. For the work on the records themselves, I also made the per-item update conditional, so that marking an already-done item does nothing rather than re-triggering the downstream side effect. Belt and braces, because the lock protects against overlap and the conditional protects against everything else.

the timing detail that nearly hid it

One more thing worth recording, because it cost me an hour. The two runs read the queue at "almost the same moment", and that almost matters. If the first job had finished marking items done before the second started, the second would have found an empty queue and done nothing, and I'd never have seen the bug. It was only because both started within the same second, before either had committed any progress, that they both saw the full backlog. Which means this had probably happened before, harmlessly, on nights when the timing was slightly kinder, and nobody noticed. The bug wasn't new. The timing that made it visible was new.

what I actually changed

I corrected the templating so only one host is marked primary, because two was simply wrong. But I treated that as the lesser fix. The flock wrapper and the idempotent updates are the ones that let me sleep, because they hold even when some future deploy gets the config wrong again, which it will. A job that drains a shared resource should never trust that it is alone. It almost always is. The cost of the one night it isn't is a customer getting two emails, a quiet bug report, and a morning spent reading timestamps. Cheap insurance, in the end, and I should have bought it a year ago.