Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

the backup you never restore is just hope on a disk

How I moved from a homelab backup job I assumed worked to one I restore from monthly, with restic, healthchecks pings, and a calendar reminder I can't ignore.

A server rack with cabling

For about three years I had backups that I was fairly sure worked. That phrase, "fairly sure", is the whole problem. I had a restic repo, a cron job, a green tick in a log somewhere. What I did not have was a single instance of actually pulling data back out and checking it was the thing I put in. Schrödinger's backup: simultaneously fine and a smoking crater until you open the box.

The catalyst was boring. I went to grab an old docker-compose.yml for a service I'd torn down, found it in the backup listing, restored it, and it was empty. Not corrupt, not missing, just zero bytes. The cron job had been backing up a bind mount that hadn't been mounted at the time it ran, so it dutifully archived nothing, every night, with a cheerful exit code 0.

A homelab shelf with mismatched hardware

So I rebuilt the whole thing around one rule: a backup I haven't restored from doesn't count. Here's roughly what runs now.

#!/usr/bin/env bash
set -euo pipefail

export RESTIC_REPOSITORY="b2:i0-homelab-backups"
export RESTIC_PASSWORD_FILE=/etc/restic/pass

# fail loudly if the source isn't actually there
test -d /srv/data/vault || { echo "vault not mounted, aborting"; exit 1; }

restic backup /srv/data --tag nightly
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune

# only ping success if we got this far
curl -fsS https://hc.i0.pm/ping/nightly > /dev/null

Two changes did most of the work. The test -d guard means the job now fails when the source isn't there instead of quietly archiving a void. And the healthchecks ping is the last line, so a green check actually means the whole thing ran, not just that cron managed to fork a shell.

The part I'd resisted for years is restore testing, because it felt like ceremony. It isn't. Once a month, on a reminder I deliberately made annoying, I restore a random snapshot into a scratch directory and diff a handful of known files:

restic restore latest --target /tmp/restore-check --include /srv/data/vault/notes
diff -r /srv/data/vault/notes /tmp/restore-check/srv/data/vault/notes && echo OK

If that prints OK, I believe the backup. If it doesn't, I find out on a Tuesday afternoon when I have time, rather than at 2am when the array has died. That's the entire value proposition. Restore testing isn't about the times it passes, it's about moving the discovery of failure to a moment of your choosing.

I'm not running anything clever here. No bare-metal recovery drills, no offsite tape rotation. It's restic to Backblaze B2, a guard clause, a ping, and a calendar nag. But for the first time I can say the backups work without crossing my fingers, because last Sunday I watched one come back. That's the difference between a backup and a directory full of optimism.