backups I actually test now

A server rack with status lights

For years my backups were a job that ran, a green tick, and a quiet assumption that the tick meant something. Then a friend lost a database to a backup that had been faithfully copying a zero-byte file every night for months, and the green tick had been there the whole time. An untested backup isn't a backup. It's a hope with a cron entry.

So I changed the question I was asking. Not "did the backup run?" but "can I restore it?" Those are very different things, and only the second one is the one that matters at 3am when something's gone. The first tells you a process exited zero. The second tells you the thing you actually care about still exists in a usable form.

The mechanism is dull, which is the point. Once a week, a script spins up a throwaway container, pulls the latest restic snapshot into it, and tries to actually use the result. For the databases that means restoring the dump and running a couple of sanity queries: does the table exist, does it have roughly the right number of rows, is the most recent timestamp recent. For the file stores it checksums a handful of known files against values I recorded when I knew they were good.

A homelab setup with cabling

restic check got added too, because verifying the repository's own integrity is cheap and catches a different class of problem: bit-rot and a corrupted repo, rather than a backup that ran against the wrong thing. The two checks answer separate questions and I want both answered.

The part that actually changed my behaviour wasn't the testing, it was the alerting. The restore test pushes a heartbeat to a dead-man's-switch service when it passes. If the heartbeat doesn't arrive, I get poked. This inverts the usual failure mode beautifully: I no longer rely on a failing job to shout, because failing jobs have a nasty habit of failing silently, including failing to send their own failure alert. Silence is now the alarm.

There's a subtlety worth naming here. The restore test runs against a throwaway container precisely so that it can't touch anything real. The first version of this script restored into a path that overlapped with live data, which is a fantastic way to turn your backup test into your worst incident. A restore test that can write over the thing it's protecting isn't a safety net, it's a loaded gun pointed at your own foot. So the test now runs fully isolated, reads from the snapshot, proves the data, and throws the whole container away when it's done.

None of this is sophisticated and that's deliberate. The whole thing is maybe sixty lines of shell and a free heartbeat URL. But I've now watched a real restore complete, with my own eyes, more than once, and the difference in how I sleep is genuine. The first time the weekly test caught a problem, a credential rotation had quietly broken the restore path while the backups themselves carried on looking perfectly healthy, I was so pleased I nearly framed the alert.