The backups were fine right up until they weren't. Every night a cron job tarred up a data directory, gzipped it, and shipped it offsite. Restores worked, the file sizes looked sensible, nobody thought about it. Then one morning a restore test failed on a truncated archive, and when I checked, three of the last ten nightlies were short. Not zero bytes, which would have been an obvious failure, but short and broken in a way that only showed up when you actually tried to read them back. The worst kind of backup: the one that exists, looks plausible, and is useless.
The job itself was simple and, I thought, correct. It ran at 02:00, it had run for over a year, and the crontab had one line. So my first suspicion, a duplicate entry on another host, was the same wrong guess I always make and it was wrong again.
What had changed was the data. The directory had grown. The backup used to finish in twenty minutes; now, on a busy night, it was taking well over an hour. And the job didn't only run at 02:00.
0 2 * * * /usr/local/bin/nightly-backup.sh
0 3 * * * /usr/local/bin/nightly-backup.sh
Someone, possibly past me, had added a second schedule at 03:00 as a "catch-up" in case the first failed. When the backup was quick that was harmless: the 02:00 run finished long before 03:00 started. But once the job started running over an hour, the 02:00 run was still writing its archive when the 03:00 run woke up, opened the same output file, and started writing its own. Two processes, one file, both convinced they owned it. The result was whatever interleaving of two tar streams the kernel happened to schedule. Sometimes one won cleanly. Sometimes you got a Frankenstein archive that gzip would happily produce and never read back.
The "only sometimes" was just timing. On a quiet night the first run finished in time and there was no overlap. On a busy night it didn't, and the two collided. Nothing in the logs complained, because nothing was checking. Both runs exited zero. Both runs thought they'd done their job.
The proper fix has two parts. First, the daft second schedule went away; a catch-up that runs unconditionally an hour later isn't a safety net, it's a second loaded gun. Second, and more importantly, the job now refuses to run if another copy of itself is already running. That's a one-liner with flock:
#!/bin/bash
exec 200>/var/lock/nightly-backup.lock
flock -n 200 || { logger "nightly-backup: already running, skipping"; exit 0; }
# ... do the backup, writing to a temp file then renaming atomically ...
out="/backups/data-$(date +%F).tar.gz"
tar czf "$out.tmp" /srv/data && mv "$out.tmp" "$out"
Two things doing the work here. flock -n takes a non-blocking lock on a file descriptor; if a previous run still holds it, the new run bails immediately rather than piling in. And writing to .tmp then renaming means the final filename only ever appears once the archive is complete, because rename is atomic on the same filesystem. A half-written backup never gets the real name, so a downstream consumer can never pick one up mid-flight.
The thing I keep relearning is that a cron schedule says when a job starts, never how long it runs or whether the last one has finished. The moment a job can take longer than the gap between its invocations, overlap stops being theoretical. If two copies running at once would step on each other, and almost anything writing to a fixed path will, the job has to lock itself. Cron won't do it for you. It'll just start the next one, cheerfully, on time, straight into the back of the one already running.