the cron job that ran twice and told nobody

A terminal full of log output with a bug somewhere in it

The bug report was three words long: "duplicate rows again." A nightly export job that pulled records, transformed them, and pushed them to a downstream system was occasionally producing each row twice. Not every night. Not predictably. Just often enough that someone downstream had built a deduplication step and stopped telling us, until the dedup itself broke and the duplicates leaked through.

That is the worst kind of bug. Intermittent, silent, and worked around by someone else for long enough that the institutional memory of it has evaporated. By the time it reached me, nobody could even say when it had started.

First, prove it is real

Before chasing causes I wanted the symptom pinned down. I added a job ID and a hostname to every row the export wrote, and a single log line at the very top of the script:

echo "$(date -Is) export start pid=$$ host=$(hostname) jobid=${JOBID:-none}" \
  >> /var/log/export/run.log

Two nights later, there it was. On the duplicate nights, two start lines, seconds apart, from the same host. The job was simply running twice. Not a logic bug in the export at all, just two invocations stepping on each other. Which raised the obvious question: who is starting it twice?

Following the trigger

My first assumption was a retry. Some wrapper script catching a non-zero exit and re-running. I read the wrapper. It was innocent. I read the systemd timer that I was sure triggered the job, found it clean, and very nearly stopped there, satisfied it was a one-off.

The thing that saved me was not trusting my own assumption about what started the job. I checked every scheduler on the box, not just the one I expected:

systemctl list-timers --all
crontab -l
sudo crontab -l
for u in /var/spool/cron/crontabs/*; do echo "== $u =="; cat "$u"; done
ls -la /etc/cron.d/

The systemd timer was real and correct. So was a line in /etc/cron.d/ that nobody had mentioned, dropped there months earlier by a configuration management run that predated the timer migration. Two schedulers, both pointed at the same script, both firing around midnight. Most nights they fired far enough apart that the first run finished before the second started and the second simply re-exported the same already-processed records as a clean duplicate set. On the nights the report called "duplicate again," they overlapped.

Code on screen with a logging line highlighted

Why it was silent

The reason this ran for months without a single alert is the part worth dwelling on. Both schedulers succeeded. Every single run exited zero. There was no error to catch, no failed unit, no non-zero status anywhere in any log. From every monitoring system's point of view, the job was perfectly healthy: it ran on schedule, it finished cleanly, it returned success. Twice. Correctness and success are not the same thing, and almost all of our monitoring only measures success.

The fix, and the better fix

The immediate fix was to delete the orphaned /etc/cron.d/ entry and leave the systemd timer as the single source of truth. Five seconds, one commit, done.

The fix I actually cared about was making it impossible for this class of bug to be silent again. The job now takes a lock before it does anything, so a second concurrent invocation refuses to start rather than quietly duplicating work:

exec 9>/run/lock/export.lock
if ! flock -n 9; then
  echo "$(date -Is) export already running, refusing to start" >&2
  exit 1
fi

flock -n returns immediately if it cannot get the lock, so the overlapping run now exits non-zero. And a non-zero exit is the one thing every monitoring system we own actually notices. The bug went from invisible to loud, which is most of the battle.

What I took from it

Three things stuck. First, when something is intermittent and you cannot reproduce it, instrument the symptom before you theorise about the cause. The two start lines told me more than a week of staring at the export logic would have. Second, do not trust your own mental model of what triggers a job. Enumerate every scheduler on the box, because the one you forgot about is exactly the one causing the problem. Third, and this is the one I keep relearning: a job that exits zero is not a job that did the right thing. We monitor for failure because failure is easy to detect, but the genuinely nasty production bugs are the ones that succeed at doing the wrong thing. A lock that turns "ran twice" into a hard error is worth more than any amount of careful logic downstream, because it converts a silent correctness bug into the kind of noisy failure our tooling was built to catch.