Someone in finance got the same summary email twice, eleven minutes apart, on a Tuesday. Just once, just them, just that day. Everyone else got one copy. That is the worst possible bug report because there is nothing systematic to grab hold of: it isn't every run, it isn't every recipient, and the job exits cleanly every single time. No errors, no stack trace, just one human noticing a duplicate in their inbox.
The job was a daily report mailer, scheduled by cron a few minutes after midnight. My first instinct was the usual suspect, a duplicate crontab on a second host. I've been bitten by that before. But there was only one host, one crontab line, and the syslog showed cron starting the job exactly once. So that theory died early, which was annoying, because it's normally the answer.
What actually happened was the clock moved.
Mar 20 00:05:00 reporter CRON[20194]: (root) CMD (/usr/local/bin/daily-report.sh)
Mar 20 00:16:00 reporter CRON[20677]: (root) CMD (/usr/local/bin/daily-report.sh)
Two starts, eleven minutes apart, from one cron. That's not how cron behaves unless time itself does something strange. And it had. This box had drifted, ntpd had been off for a while during some maintenance, and when it came back the clock was ahead. When ntpd corrected it, instead of slewing gently it stepped the clock backwards, past a minute boundary that the job's schedule sat on. Cron, being a creature of wall-clock minutes, saw that minute tick over a second time and dutifully fired the job again.
The eleven minutes was the size of the drift correction. The "only finance" detail was a red herring caused by how the mail merge batched recipients; both runs happened to include that person near a batch boundary, and the rest of the duplicates got swallowed by a downstream dedupe I'd forgotten existed. Of course they did.
So the root cause was a clock step, but I want to be careful here, because "fix NTP" is the trap. Yes, I configured ntpd to slew rather than step under normal conditions, and on later boxes I moved to making sure corrections were gradual. That stops this particular trigger. But it does not make the job safe. A clock can step for a dozen reasons: a VM pausing and resuming, a hypervisor migration, a manual date set by someone in a hurry, a leap second handled badly. Any of those can re-fire a cron job, and chasing each cause individually is a losing game.
The real fix is to make the job not care how many times it runs. The mailer needed to know what it had already sent. A few lines did it: before sending, write a marker keyed on the report's logical date, and refuse to run if the marker for today already exists.
marker="/var/lib/daily-report/sent-$(date +%F)"
if [ -e "$marker" ]; then
logger "daily-report: already sent for today, skipping"
exit 0
fi
# ... generate and send the report ...
touch "$marker"
That's idempotency in the cheapest possible form: a file named after the day. If the job runs twice within the same logical day, the second run finds the marker and bows out. It doesn't matter whether the second run came from a clock step, a duplicate crontab, a nervous operator running it by hand to "check it worked", or all three at once. The job has decided, once, what "today's report" means, and it will only do it once.
The lesson is one I keep relearning in different costumes. The schedule is not a guarantee of how many times your code runs; it's a hint. Cron will fire your job when wall-clock time says so, and wall-clock time is not a monotonic, trustworthy thing on a real machine. If running twice does something a human would notice, like sending an email or charging a card, the job itself has to be the thing that enforces "once", because nothing underneath it will.
I left the NTP fix in, and I left the marker file in. The NTP fix made this specific Tuesday less likely. The marker file made it not matter.