For years my backups passed the only test I ever gave them: they ran without erroring. The job finished, the log said "done", and I felt protected. Then I actually tried to restore something, found half of it wasn't there, and realised that an untested backup is just a hopeful collection of bytes.
So I rebuilt the whole thing around one rule: a backup I have not restored from is not a backup. Everything below follows from that.
The setup
I use restic, pointed at two destinations: a local repo on the NAS and an offsite one in object storage. restic's deduplication and encryption mean I stopped caring about the storage cost of keeping a lot of snapshots, and the snapshot model maps neatly onto how I actually think about recovery.
The backup itself is unremarkable:
restic backup /srv /home/johnm/vault \
--tag nightly \
--exclude-file /etc/restic/excludes.txt
restic forget --prune \
--keep-daily 7 --keep-weekly 4 --keep-monthly 6
Nothing clever. The cleverness, such as it is, lives in the verification.
The bit that matters
Once a week, a separate job picks a random snapshot, restores a known subset of files into a scratch directory, and compares them against checksums I recorded when the data was healthy. If the hashes match, it writes a success metric. If they don't, it shouts.
restic restore latest \
--target /tmp/restore-check \
--include /srv/canary
sha256sum -c /etc/restic/canary.sha256
I also run restic check --read-data-subset 5% regularly so a slice of the actual data blocks gets read back and verified, not just the metadata. Reading 5% each time means the whole repository gets covered over a few weeks without hammering the offsite bandwidth all at once.
Making it impossible to ignore
The failure mode of any verification system is that you stop looking at it. So the restore check publishes a single metric to my monitoring, and if it hasn't reported success in 48 hours, I get paged. Not an email I'll archive unread. An actual page.
There's also a quarterly reminder in my calendar to do a manual restore: pull a real file I care about, from the offsite repo, onto a machine that isn't the one that made the backup. The automation tests the mechanism. The manual drill tests me, and whether I still remember the passphrase and the repo URL when I'm half-asleep and stressed.
The whole point is that the first time you restore from a backup should never be the time you actually need it. Mine have failed their tests twice since I set this up, both times for boring reasons (a changed exclude rule, an expired credential), and both times I'd rather have found out on a Tuesday afternoon than during a genuine disaster.