Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the cron job that ran twice and told no one

A nightly job was firing twice and producing duplicate rows, and the cause was two crontabs quietly agreeing to do the same work.

A bug icon glowing over a dark terminal

The report came in as "the numbers are doubled on some days." Not every day. Some days. Which is the worst kind of bug report, because it rules out the easy explanation and hands you intermittency instead.

The job was a nightly aggregation: read yesterday's events, roll them up, write the totals. Idempotent it was not, because nobody expects a cron job to run twice, so running it twice cheerfully inserted a second set of rows. The output was visibly wrong, but only on the days it doubled, and at first I could find no pattern at all.

the timestamps gave it away

The breakthrough was logging. The job wrote a line on start and a line on finish, and when I lined up the bad days against the logs, there were two start lines, seconds apart. Two runs. Not a retry, not a loop in the code. Two genuine, separate invocations of the same script at almost the same instant.

Aug 13 02:00:01 host CRON: aggregate start
Aug 13 02:00:03 host CRON: aggregate start

Two starts, two seconds apart. Something was launching it twice.

A code listing showing crontab entries

The cause was almost insultingly simple once I went looking in the right place. The job lived in two crontabs. There was the entry someone had added to the system crontab during the original setup, and there was a second, near-identical entry in a user crontab added months later by someone migrating the schedule and not realising the old one was still live. Both fired at 02:00. Most nights they raced and one happened to win cleanly enough that the second saw the rows already present and did little harm. On the unlucky nights they overlapped, both read the same empty target, and both wrote.

# the obvious thing nobody had done: look in both places
sudo crontab -l
crontab -l -u batch
ls -la /etc/cron.d/

the fixes, in order of seriousness

I deleted the duplicate entry, which solved the immediate problem in about four seconds. But the real fix is not trusting cron to only run a thing once, because cron will happily run a thing as many times as it is told to and never mention it.

So the job now takes a lock before it does anything. flock is exactly the right tool and it is already on every box:

* 2 * * * /usr/bin/flock -n /tmp/aggregate.lock /opt/jobs/aggregate.sh

-n means "if you cannot get the lock, give up quietly rather than queue", which is what you want for a scheduled job. Now if two invocations collide, the second one finds the lock held, exits, and the world stays sane. I also made the aggregation itself idempotent, an upsert keyed on the day rather than a blind insert, so even a double run produces the right answer.

The thing I keep relearning is that "it only runs once" is an assumption, not a fact, and cron entries accumulate in places you forget to look. The duplicate had been there for months, doing damage quietly, because nothing ever shouted about it. A job that fails loudly is easy. A job that succeeds twice is the one that gets you.