The symptom was a number that was sometimes wrong. A nightly aggregation job rolled up the day's events into a summary table, and once or twice a week the totals came out roughly double. Not always, not predictably, and never on the days I was watching. The job logged "started" and "finished" cleanly each time, exit code 0, nothing in the mail. As far as the box was concerned, everything was fine.
The first thing I did was stop trusting the logs. They told me the job ran successfully. They did not tell me how many times it ran successfully on the same night, because each invocation logged to its own line and nothing tied them together. So I added a run ID and a hostname to every log line and waited.
Two nights later it happened. Two run IDs, same minute, same host, both completing. The job had genuinely been invoked twice. Now the question was why.
Two things can be true
It turned out two separate causes were conspiring, which is why it had been so hard to pin down.
The first was the cron daemon itself. We'd recently moved the host into a config-managed setup, and the deployment had, at some point, ended up with the crontab defined in both /etc/crontab and a fragment under /etc/cron.d/. Both entries were identical. Both fired. On most nights one of them lost a race for a lock file I'd added years ago and quietly exited, which is exactly why it only showed up "sometimes". The lock was masking a duplicate I didn't know existed.
The second was the lock itself. It was a naive flock on a file in /tmp, and /tmp was being cleaned by a systemd tmpfiles rule. If the cleanup ran in the window between the two invocations, the second one got a fresh file and a fresh lock, and both jobs ran to completion. Rare, but not rare enough.
# what I had, roughly
exec 9>/tmp/aggregate.lock
flock -n 9 || exit 0
run_aggregation
The fix was boring in the best way. One crontab entry, defined in exactly one place, enforced by the config management so it couldn't drift back. The lock moved off /tmp to a path nothing else touches, and I added flock's exit handling so a failed lock acquisition logs a line rather than silently exiting, because "I couldn't get the lock" is information I actually want.
exec 9>/var/lib/aggregate/lock
if ! flock -n 9; then
logger -t aggregate "another instance holds the lock, skipping"
exit 0
fi
The deeper lesson, and the one I keep relearning, is that "it logged success" is not the same as "it ran once". Idempotency would have saved me here too: if the aggregation had been written to produce the same answer whether it ran once or five times, the double-fire would have been harmless and I'd never have noticed. I've since made the job tear down the day's summary rows before rebuilding them, so a second run overwrites rather than adds. The duplicate crontab is gone, but I no longer rely on it being gone.