The point first, because it is the only bit that matters: a bad upgrade left my home server unbootable on a Saturday morning, and I had it back exactly as it was before in about forty seconds. No reinstall, no rummaging through backups, no swearing at initramfs. I rolled back to a snapshot that had been taken automatically, half a second before the transaction that broke everything, and got on with my weekend.
I have been running btrfs on the root filesystem of that box for a couple of years now. For a long time it felt like a thing I had configured and then forgotten about, which is the best kind of thing. This was the weekend it paid for itself.
what actually broke
Nothing dramatic. I ran a routine system update, the kind I run without thinking, and somewhere in the middle a kernel and a chunk of the userspace got out of step. The machine came up to a black screen and a kernel panic about not finding init. I am sure I could have debugged it. I am also sure I did not want to, at half eight on a Saturday, with coffee going cold.
The thing about a panic like that is the temptation to start poking. Boot a rescue USB, chroot in, reinstall the kernel, regenerate the initramfs, cross your fingers. I have done that dance before and it eats hours. The whole appeal of snapshots is that you do not have to understand the failure to recover from it. You just go back to a moment when it worked.
the setup that made it boring
The magic, such as it is, comes from two things working together: btrfs subvolumes laid out so that root is its own subvolume, and a tool that snapshots that subvolume before every package transaction.
On this box I use snapper, hooked into the package manager so that every install or upgrade brings a paired pre/post snapshot into being. You end up with a timeline you can list:
snapper list
# | Type | Pre # | Date | Cleanup | Description
---+--------+-------+--------------------------+---------+--------------
0 | single | | | | current
42 | pre | | Sat Dec 30 08:31:14 2023 | number | pacman -Syu
43 | post | 42 | Sat Dec 30 08:33:02 2023 | number | pacman -Syu
Snapshot 42 is the filesystem as it was the instant before the upgrade began. That is the one I wanted.
The layout matters more than the tooling. If / lives on a subvolume (commonly @) and /home on another (@home), you can roll the system back without touching your home directory. That distinction is the difference between "undo the bad upgrade" and "lose everything I did this week". Get the subvolume layout right once, at install time, and everything afterwards is easy.
You can see what you have with:
btrfs subvolume list /
the actual recovery
I booted the install medium, mounted the btrfs top level, and looked at what was there. The recovery is conceptually simple: make the good snapshot the new default subvolume, and reboot into it.
mount /dev/nvme0n1p2 /mnt
btrfs subvolume list /mnt
# find the snapshot id you want, then:
btrfs subvolume set-default <id> /mnt
If you are on an openSUSE-style layout, snapper does this for you with a single snapper rollback, then a reboot. Either way the mechanism is the same: point / at a copy-on-write snapshot of itself from before the damage. Because it is copy-on-write, this costs no real disk space and no real time. The blocks are already on disk; you are just changing which set of them the system calls root.
Forty seconds later the machine was up, on the old kernel, exactly as it had been on Friday night. I redid the upgrade a few days later once the package mirror had caught up, and it went through without complaint.
what I learned, mostly about myself
The genuine lesson is not "btrfs good". It is that the value of a snapshot is entirely in how cheap it is to take and how boring it is to restore. If taking one is a chore, you will not take them. If restoring one means a flowchart and a prayer, you will not trust them and you will reach for the rescue USB anyway. Snapper-before-every-transaction wins because both halves are free. You forget it is there until the day you need it, and then it is just there.
A few honest caveats, because this is not a silver bullet.
- Snapshots are not backups. They live on the same disk. If that disk dies, your snapshots die with it. I still run actual offsite backups, and you should too. A rolled-back filesystem and a restored backup solve different problems.
- The subvolume layout is the whole game. If your snapshot includes
/home, a rollback throws away your recent work along with the bad upgrade. Separate them. - Cleanup policy matters. Left unmanaged, snapshots accumulate and your disk fills with old states. Snapper's number/timeline cleanup handles this, but you have to let it.
- A rollback does not undo a database write or an in-flight transaction cleanly. This is great for system state, not a substitute for application-level recovery.
None of that diminishes the morning. I have spent entire days recovering from less than this. To watch a panic turn into a non-event in under a minute, then close the laptop and go and do something with the rest of the weekend, is a quietly excellent feeling. The best infrastructure is the kind you never have to think about, right up until the one moment it saves you, and then it earns every bit of the setup.
If you run Linux on btrfs and you have not wired up automatic pre-transaction snapshots, do that this week. It is twenty minutes of work for the kind of insurance you only appreciate in hindsight.