Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

backups you haven't restored are just wishes

After a restore that didn't work when I needed it, I added a monthly automated restore test to the homelab so I find out my backups are broken on my schedule, not the universe's.

A server rack in a home rack

I had backups for years before I had a working restore, and there's a meaningful difference between the two that you only discover at the worst possible moment. Mine arrived when a disk in the homelab gave up and I went to pull back a Postgres dump that had been "running fine" nightly for months. The dump file was there. It was the right size, roughly. It was also truncated, because the backup script piped pg_dump into gzip without set -o pipefail, so when pg_dump died halfway through one night with a connection error, the script cheerfully reported success and wrote half a database into a perfectly valid gzip stream.

The job had been green every morning. Green meant "the script exited zero", which it did, every single time, including the times it backed up nothing useful. I'd been collecting confident-looking failures.

the change in habit

So the rule now is simple and slightly tedious: a backup that has never been restored does not count as a backup. It counts as an intention. The only way I trust one is to restore it somewhere and check the result, and the only way I actually do that consistently is to make a machine do it for me, because I will not do it by hand more than twice before I get bored and stop.

The monthly job spins up a throwaway container, restores the latest dump into it, and runs a couple of assertions: does the schema look right, and is the row count in a handful of key tables within shouting distance of production. Not a full diff, just enough to catch "the file is empty" and "the restore errored halfway".

A homelab shelf with cables and labelled drives

#!/usr/bin/env bash
set -euo pipefail

LATEST=$(ls -1t /backups/db/*.sql.gz | head -1)
echo "testing restore of: $LATEST"

docker run --rm -d --name restore-test -e POSTGRES_PASSWORD=test \
  postgres:16 >/dev/null
sleep 8

gunzip -c "$LATEST" | docker exec -i restore-test psql -U postgres >/dev/null

ROWS=$(docker exec restore-test psql -U postgres -tAc \
  "SELECT count(*) FROM users;")
docker stop restore-test >/dev/null

if [ "$ROWS" -lt 1000 ]; then
  echo "FAIL: users table only has $ROWS rows" >&2
  exit 1
fi
echo "OK: restored cleanly, users=$ROWS"

It's crude. It restores into a stock Postgres image, checks one table, and throws the container away. But it does the one thing the nightly backup never did, which is actually read the file back and complain when it's wrong. The first time I ran it, it failed, because of course it did. That's the entire reason it exists.

The result of the failure goes to the same place a real outage alert would, so a broken restore wakes me up gently on a Tuesday rather than violently during an actual disaster. I'd much rather find out my backups are useless on my own schedule than on the universe's. The universe has appalling timing.

I added set -o pipefail to the original backup script the same afternoon, obviously. But the lesson wasn't really about pipefail. It was that I'd been trusting an exit code to tell me something it couldn't possibly know. Only the restore knows whether the backup worked, so the restore is the only test worth running.