The symptom was the worst kind: nothing was on fire, but every so often a nightly backup archive came out truncated or just plain corrupt. No error in the logs. The job reported success. The exit code was zero. And yet roughly one restore test in five failed with an unexpected end of archive.
I assumed bad disk first, because you always do. SMART was clean. I assumed the archive tool, swapped it, no change. The thing that finally gave it away was the timestamp on the lock-free output: two rsync processes, both spawned by cron, both writing to the same staging directory, overlapping by a few minutes.
what was actually happening
The backup had grown. When I wrote the cron entry the job took twenty minutes, so a nightly run at 02:00 was long done before the next one. Two years and a lot more data later, the job sometimes took past 24 hours when the source was busy. Cron, being cron, doesn't care. It fires at 02:00 regardless of whether last night's run is still going. So on the bad nights, run N+1 started while run N was still writing, both pointed at the same temp directory, and they trampled each other's output. The "success" was the second process finishing cleanly on a directory the first had half-rewritten.
Cron has no concept of "don't start if the last one is still running". That's not a bug in cron, it's just not its job. The fix is a lock the job takes itself.
the fix
flock is built for exactly this and it's already on every box:
# refuse to start if a previous run still holds the lock
0 2 * * * backup flock -n /var/lock/nightly-backup.lock /opt/backup/run.sh
The -n means non-blocking: if the lock is held, flock exits immediately rather than queueing up a second run to start the moment the first finishes. I'd rather skip a night and get alerted than silently stack runs. The lock is released automatically when the process dies, so a crashed job doesn't wedge the schedule forever.
I also added the obvious thing I should have had from the start: a log line with the run duration, and an alert if it ever exceeds half the interval. The real bug wasn't the overlap, it was that the job had quietly tripled in runtime over two years and nothing was watching that number. The overlap was just the symptom that finally got loud enough, by way of corrupt archives, for me to notice.
The general rule I took away: any job that can possibly run longer than its own interval needs a lock, and you should assume every long-lived periodic job will eventually grow into that situation whether you planned for it or not.