Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

a backup you've never restored is just a hope

After a restore that nearly didn't work, the changes I made to my homelab backups so I trust them when it counts.

A server rack with status lights

For years my backups were a comfort blanket I never washed. Restic ran nightly, the job went green, the email said "snapshot complete", and I felt responsible. Then one evening I needed to actually pull a file back, a config I'd deleted and regretted, and the restore took four anxious minutes of me discovering I had no idea how to do it. The data was fine. My confidence was not. A backup you've never restored isn't a backup, it's a hope with a cron entry.

So I rebuilt the whole thing around one principle: the restore is the feature. The backup is just the boring prerequisite.

What I was doing wrong

Three things, in hindsight.

I was monitoring the backup job and not the backup. A successful restic backup exit code tells you the snapshot was written. It tells you nothing about whether the snapshot is readable, complete, or restorable on a fresh machine. Those are different claims and I'd been conflating them.

I had no off-site copy I trusted. Everything lived on the NAS. A backup in the same room as the thing it's backing up survives a disk failure and nothing else. Fire, theft, a fat-fingered rm against the wrong mount, all of those take the original and the backup together.

And I had never, not once, done a clean-room restore. Restoring a file you can see the path of is easy. Restoring when you've lost the index, the original host, and your composure is the test that matters, and I'd never sat it.

A homelab shelf with mixed gear

The 3-2-1 bit, done properly

The old rule still holds: three copies, two media, one off-site. My version now:

  • The live data on the host.
  • A nightly restic repo on the NAS, local and fast for the common case of "I deleted a thing".
  • An off-site restic repo in object storage, synced after the local one, so a single building can burn down without taking my data with it.

The off-site copy is encrypted before it leaves the house, which restic does for you, so I don't have to trust the storage provider with plaintext. That removed the last excuse I had for not having one.

Testing, automatically

This is the part that changed everything. Once a week a small systemd timer does a real restore into a scratch directory and checks the result:

#!/usr/bin/env bash
set -euo pipefail

REPO="$1"
SCRATCH="$(mktemp -d)"
trap 'rm -rf "$SCRATCH"' EXIT

# verify repository integrity, reading a sample of actual data
restic -r "$REPO" check --read-data-subset=5%

# restore one known canary file and compare it
restic -r "$REPO" restore latest \
  --target "$SCRATCH" \
  --include /etc/canary.txt

if ! diff -q /etc/canary.txt "$SCRATCH/etc/canary.txt" >/dev/null; then
  echo "RESTORE MISMATCH" >&2
  exit 1
fi

echo "restore verified: $(date -Iseconds)"

The --read-data-subset=5% is the important flag. Plain restic check only validates the structure and metadata. Adding --read-data-subset makes it actually read and verify a slice of the real backup blobs, which is the difference between "the catalogue looks right" and "the data is genuinely there and not corrupt". Over a month it rotates through enough of the repo to give me real confidence without the cost of reading everything every night.

The canary file is a fixed thing I know the contents of, restored end to end and diffed. If that comparison ever fails, the timer fails, and a failed timer is the one thing my monitoring does shout about.

A note on how the alerting hangs together, because a test that fails silently is no better than no test. The systemd timer runs the script as a service unit, and systemd's OnFailure= hook fires a second unit that pushes a notification when the verification exits non-zero. So a corrupt repo, a missing canary, or restic itself falling over all converge on the same outcome: my phone buzzes. The success path is deliberately quiet. I do not want a weekly "backup verified" message, because a weekly message that's always the same is a message you stop reading, and the day it doesn't arrive is the day you won't notice. Absence of failure is the signal, not presence of success.

The schedule itself is worth thinking about rather than just picking a number. Backups run nightly because a day's loss is the most I'm willing to accept for this data. Verification runs weekly because a full read-and-diff has a cost and the failure modes it catches, bit rot and repository corruption, develop over weeks not hours. The off-site sync runs after the local backup completes, chained rather than scheduled independently, so I can never end up shipping a half-written snapshot off-site. Getting the ordering wrong there is a subtle way to back up corruption faithfully to two locations.

The full-restore drill

Automated checks are necessary but not sufficient, because they don't test the human. So twice a year I do the real thing: spin up a fresh VM with nothing on it, install restic, and restore a whole service from off-site as if the house had burned down. No notes from last time, deliberately, because if I need notes I haven't kept them current.

The first time I did this it took an hour and exposed two gaps. I'd never recorded the repository password anywhere I could reach without the machine that was gone, which is a delightful catch-22. And one service stored state outside the path I was backing up, so it restored into a broken state. Both were invisible until I tried. Now the password lives in a password manager that itself has an off-site export, and the backup paths get reviewed whenever I add a service.

What this actually bought me

Honestly, mostly sleep. The mechanical reliability is roughly what it always was, restic was never the weak link. What changed is that I now know, with evidence rather than faith, that I can get my data back. The weekly timer proves the data is readable. The off-site copy proves a single disaster won't take everything. The twice-yearly drill proves that I, the slowest and least reliable component, can drive the whole process under pressure.

A green backup job felt like safety. A successful restore, repeated on a schedule, is safety. The gap between those two is exactly the gap that ruins people's weeks, and I'm glad I closed it before I had to find out the hard way.