Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

backups i actually test now

After a year of running backups I never verified, I rebuilt the home lab backup setup around restores that get tested automatically, not hoped at.

A server rack with a single drive activity light glowing

For most of last year I had backups in the way that most people have backups: a cron job that ran something, a remote directory that filled up, and a comfortable, entirely unearned belief that if the worst happened I'd be fine. I had never once restored from them. A backup you have never restored is not a backup, it's a hope with a timestamp.

The reckoning came when I needed to roll a service back, reached for the archive, and found that the job had been failing silently for weeks because a credential had rotated and nothing told me. The data I wanted was simply not there. Nothing was lost that mattered this time, but it was a clear enough warning that I spent the holiday rebuilding the whole thing properly.

What I changed

The core shift was moving from "make copies" to "prove I can get the data back". I switched the home lab backups to restic against a couple of destinations: one local on a separate ZFS pool, one off-box. restic gives me content-addressed deduplication and a check command that actually validates the repository integrity, which on its own is more verification than my old setup ever did.

restic backup /srv /etc --repo /mnt/backups/restic
restic check --read-data-subset=5% --repo /mnt/backups/restic

That check --read-data-subset matters. It doesn't just trust the index, it reads back a sample of the actual data and confirms it's intact. I run the full structural check nightly and a sampled data check weekly.

A diagram of backup sources flowing to local and remote repositories

The part I'd skipped: actually restoring

Verifying the repository is good, but it still doesn't answer the only question that counts, which is "can I stand the service back up from this". So the new setup does a real restore on a schedule, into a throwaway container, with nobody watching.

A small script pulls the latest snapshot of a service into a scratch directory, brings it up in an isolated container, and checks that the application actually starts and answers. If the restore fails, the repository is corrupt, or the restored service won't run, I get told the same way I get told about anything else that's broken. Crucially, this fails loudly. The whole disaster last year was a thing failing quietly.

restic restore latest --repo /mnt/backups/restic \
  --target /scratch/restore-test --include /srv/myapp
# then: spin up a container against /scratch/restore-test and curl its health endpoint

What it bought me

Three things, really. I now know my backups restore, because a machine demonstrates it every week without my involvement. I find out within a day when something breaks, instead of the morning I desperately need the data. And I sleep better, which sounds soft but is the actual point of all of this.

The old setup wasn't lazy, it was optimistic, and optimism is a wonderful trait everywhere except disaster recovery. The only backup you can trust is one you have watched come back to life. So now I make mine do it on a timer, whether I'm paying attention or not.