the backups i finally bothered to test

A rack of homelab servers with cabling

For years my backups were a lie I told myself. There was a cron job. It ran. It reported success. I never once restored from it. A backup you have never restored is not a backup, it is a hopeful gesture, and I knew that, and I ignored it because restoring is tedious and the job said "OK" every night.

The thing that changed my mind was nearly losing a database. Not to a disk failure, to my own clumsiness. I ran a migration against the wrong host, realised mid-command, and had that cold moment of "right, can I actually get this back". The answer turned out to be yes, but only after an hour of fumbling that should have taken five minutes. The data was fine. My confidence was not.

what I changed

I tore out the old shell scripts and rebuilt everything around restic. I had resisted it for ages out of pure inertia, and I was wrong to. It does deduplication, encryption, and snapshots, and it treats restoring as a first-class operation rather than an afterthought.

The setup is unremarkable, which is the point. A repository on a local NAS for fast restores, and a second copy pushed to object storage offsite. Same tool, two destinations.

restic -r /mnt/nas/restic backup /srv /etc /home --tag nightly
restic -r s3:s3.example.net/backups backup /srv /etc /home --tag nightly

Retention is handled by forget with a policy I can actually reason about:

restic -r /mnt/nas/restic forget \
  --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune

Seven dailies, four weeklies, six monthlies. That covers "I deleted it this morning" and "this file quietly rotted three months ago and nobody noticed" without keeping everything forever.

A homelab setup with drives and network gear

the bit that actually matters

None of the above is new or clever. The change that made the difference is that I now test restores on a schedule. On the first weekend of the month I pick a snapshot at random and restore part of it to a scratch directory, then diff it against the live data.

restic -r /mnt/nas/restic restore latest \
  --target /tmp/restore-test --include /etc
diff -r /etc /tmp/restore-test/etc

The first time I ran this drill properly I found two problems. One path was being excluded by a stale rule I had forgotten about, so a config directory was not in the backups at all. And the offsite repository had not completed a successful run in eleven days because the S3 credentials had expired and the error was buried in a log nobody read. The nightly job had been cheerfully reporting success for the local copy while the offsite copy quietly stopped.

That is the whole argument for testing restores. The backup job tells you it wrote something. It does not tell you that what it wrote is complete, or that you can get it back, or that the second copy you are relying on for the house-fire scenario is even running. Only a restore tells you that.

verification, automated

I also turned on restic check with a subset read, which actually pulls a sample of the data back and verifies it against the stored hashes rather than just checking the metadata:

restic -r /mnt/nas/restic check --read-data-subset=10%

It reads ten percent of the pack files each run, so over ten runs it has effectively verified the whole repository, spread out so it does not hammer the disks. If a pack has gone bad on the underlying storage, this catches it long before I need that data in anger.

the routine now

So the shape of it is: nightly backups to two destinations, monthly random restore drill, rolling integrity check on the repository. It takes me about ten minutes a month of actual attention, and in exchange I have moved from "I think I have backups" to "I restored from this one on Saturday and it was fine".

The migration-against-the-wrong-host moment will happen again, because I am the kind of person it happens to. The difference is that next time I will not have to wonder whether the safety net is real. I checked it last weekend. It holds.