a btrfs snapshot turned a wrecked upgrade into a five-second rollback

A Linux terminal on a dark screen

I had plans for Saturday. None of them involved a machine that wouldn't boot, but that is what an enthusiastic system upgrade handed me at about nine in the morning. A pile of packages went in, a kernel module went sideways, and the next reboot dropped me into a recovery shell with that particular flavour of dread you get when you realise the broken thing is also the thing you'd use to fix it.

In the old days this is where the weekend goes. You boot a live USB, mount things, chroot in, work out which package broke what, and pick at it for hours. I'd done it enough times to know roughly how long it takes, which is "longer than you have".

This time it took about five seconds, because the upgrade had already snapshotted the filesystem before it touched anything.

the setup that did the saving

The whole trick is that the root filesystem is btrfs, laid out in subvolumes, and the package manager takes a snapshot of the root subvolume immediately before each transaction. Snapshots on btrfs are cheap because they're copy-on-write: the snapshot shares every block with the live filesystem and only starts consuming space as things diverge. So "snapshot before every upgrade" costs almost nothing and you forget it's even happening, right up until the moment it saves you.

When the boot failed, I dropped to a shell from the boot menu, listed the snapshots, and rolled back to the one taken seconds before the upgrade:

btrfs subvolume list /
# find the pre-upgrade snapshot id
btrfs subvolume set-default <id> /
reboot

The machine came back up exactly as it had been the moment before I'd broken it. No live USB, no chroot, no detective work. The broken upgrade simply hadn't happened, as far as the running system was concerned.

what i actually learned

The valuable part wasn't the rollback itself, satisfying as that was. It was realising I'd been treating snapshots as a backup story when they're really an "undo" story, and the two are different.

A backup protects you from losing data: the disk dies, the building floods, ransomware gets in. Snapshots on the same disk do none of that, and if you confuse the two you'll have a bad day eventually. I still have proper off-machine backups and I'm not giving them up.

But for the specific, common, self-inflicted case of "I changed something and now the system is worse", a local snapshot is the right tool and a backup is the wrong one. Restoring from an off-site backup to undo a botched upgrade is using a sledgehammer to crack a nut, and it's slow besides. The snapshot gave me a clean, instant rollback for exactly the failure mode I hit most often, which is me.

So the lesson is to keep both, and to be clear in my head about which one each is for. Backups for disasters that come from outside. Snapshots for disasters that come from me, which, if I'm honest, is most of them. I got my Saturday back, and I didn't even have to find the live USB.