Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

Backups I Actually Test Now

Moving from backups I hoped would work to backups I verify on a schedule, with restic, automated restore checks, and a calendar reminder I can't ignore.

A homelab server rack

For years my backups passed the only test I ever gave them: they ran without erroring. The job finished, the log said "done", and I felt protected. Then I actually tried to restore something, found half of it wasn't there, and realised that an untested backup is just a hopeful collection of bytes.

So I rebuilt the whole thing around one rule: a backup I have not restored from is not a backup. Everything below follows from that.

The setup

I use restic, pointed at two destinations: a local repo on the NAS and an offsite one in object storage. restic's deduplication and encryption mean I stopped caring about the storage cost of keeping a lot of snapshots, and the snapshot model maps neatly onto how I actually think about recovery.

The backup itself is unremarkable:

restic backup /srv /home/johnm/vault \
  --tag nightly \
  --exclude-file /etc/restic/excludes.txt

restic forget --prune \
  --keep-daily 7 --keep-weekly 4 --keep-monthly 6

Nothing clever. The cleverness, such as it is, lives in the verification.

The bit that matters

Once a week, a separate job picks a random snapshot, restores a known subset of files into a scratch directory, and compares them against checksums I recorded when the data was healthy. If the hashes match, it writes a success metric. If they don't, it shouts.

restic restore latest \
  --target /tmp/restore-check \
  --include /srv/canary

sha256sum -c /etc/restic/canary.sha256

I also run restic check --read-data-subset 5% regularly so a slice of the actual data blocks gets read back and verified, not just the metadata. Reading 5% each time means the whole repository gets covered over a few weeks without hammering the offsite bandwidth all at once.

A homelab workspace

Making it impossible to ignore

The failure mode of any verification system is that you stop looking at it. So the restore check publishes a single metric to my monitoring, and if it hasn't reported success in 48 hours, I get paged. Not an email I'll archive unread. An actual page.

There's also a quarterly reminder in my calendar to do a manual restore: pull a real file I care about, from the offsite repo, onto a machine that isn't the one that made the backup. The automation tests the mechanism. The manual drill tests me, and whether I still remember the passphrase and the repo URL when I'm half-asleep and stressed.

The whole point is that the first time you restore from a backup should never be the time you actually need it. Mine have failed their tests twice since I set this up, both times for boring reasons (a changed exclude rule, an expired credential), and both times I'd rather have found out on a Tuesday afternoon than during a genuine disaster.