Here is the short version, because the long version is mostly me being smug: a system upgrade broke a box badly enough that it wouldn't boot, and I had it back, byte-for-byte, in about five minutes. No reinstall, no reaching for backups, no rebuilding config from memory. A snapshot taken automatically a few minutes before the upgrade did the whole job.
I had been meaning to write this up for a while. The weekend it actually mattered seemed like the right prompt.
what went wrong
The box is a small home server: a few services, some containers, nothing dramatic. I ran a routine pacman -Syu on the Saturday morning before coffee had fully landed, which should have been my first warning. A kernel bump, a glibc bump, and a handful of library updates. The upgrade itself completed cleanly. The reboot did not.
It came up far enough to mount root and then sat there. Something in the initramfs and the new kernel disagreed, and rather than debug a half-booted system over a flaky serial console on a Saturday, I just rolled it back.
the setup that saved me
The thing doing the saving is snapper, wired to take a snapshot before and after every pacman transaction. That's the part people skip. A snapshot you took last Tuesday is a backup. A snapshot taken automatically thirty seconds before the change that broke everything is a time machine.
The relevant config, roughly:
# /etc/snapper/configs/root
TIMELINE_CREATE="yes"
NUMBER_CLEANUP="yes"
NUMBER_LIMIT="50"
NUMBER_LIMIT_IMPORTANT="10"
And the pacman hook, via snap-pac, which is the bit that ties snapshots to transactions:
# /etc/pacman.d/hooks/... (provided by snap-pac)
[Trigger]
Operation = Upgrade
Operation = Install
Operation = Remove
Type = Package
Target = *
Every transaction now gets a "pre" and "post" pair. The layout matters too. Root lives on a subvolume (@) with snapshots in @snapshots, mounted at /.snapshots, which is the convention the openSUSE folk landed on and Arch borrowed. The important property: snapshots are read-only by default and cost almost nothing to take because btrfs is copy-on-write. Taking one doesn't duplicate data, it just pins the current state of the blocks. You only pay for what changes afterwards.
the actual rollback
Because the box wouldn't boot, I needed to roll back from outside it. I booted an Arch ISO from a USB stick, which I keep around precisely for these mornings, and did the unglamorous thing.
# find the device, mount the top-level
mount -o subvolid=5 /dev/nvme0n1p2 /mnt
ls /mnt/@snapshots/
snapper list would tell me the numbers on a running system, but from the rescue environment I just read the info.xml files and matched timestamps. The "pre" snapshot from the morning's upgrade was number 412. I wanted the system to boot from that state instead of the broken @.
The clean way is to swap the default subvolume. Move the broken @ aside, promote a writable copy of the good snapshot into its place:
cd /mnt
mv @ @broken
btrfs subvolume snapshot @snapshots/412/snapshot @
That second command takes the read-only snapshot and makes a fresh writable subvolume from it called @. Unmount, reboot, pull the USB stick. It came straight up, exactly as it had been before I touched anything, services and all.
umount /mnt
reboot
Total time with the faffing about for the USB stick: five minutes, maybe six. The broken @broken subvolume I kept around for a day so I could poke at what actually went wrong (the initramfs, as it turned out, an easy fix in hindsight), then deleted it with btrfs subvolume delete.
what I'd tell past me
A few things, none of them clever.
- The snapshots are worthless if you can't find them from a rescue shell. Practise the cold-boot rollback once when nothing is broken, so you're not learning the subvolume layout under pressure.
- Snapshots are not backups. They live on the same disk. The disk dies, they die with it. I still run proper off-box backups; the snapshots are for the "I broke it five minutes ago" case, which is most cases.
NUMBER_LIMITmatters. Fifty pre/post pairs of an active system can accumulate real space if your transactions churn a lot of data. Watchbtrfs filesystem df.
The honest takeaway is that this isn't exotic. btrfs subvolumes plus snapper plus snap-pac is a well-trodden path, it just sits in the "I'll set that up properly someday" pile for years until the one Saturday it earns its keep. Mine earned it. Set it up before yours has to.