a botched upgrade, and the btrfs snapshot that undid it in seconds

A Linux terminal on a dark screen

I broke my home server on Friday evening and had it back to exactly how it was inside about thirty seconds. The thing that saved me was not a backup in the usual sense. It was a btrfs snapshot I had taken automatically before the upgrade and promptly forgotten about, which is the best kind of safety net: one you do not have to remember.

The setup is unglamorous. The server runs its root filesystem on btrfs, with the system in a subvolume rather than directly on the top level. Before any package upgrade, a small hook takes a read-only snapshot of that subvolume. Plenty of people use snapper for this and it is excellent, but mine is a twenty-line script because I am stubborn and like to understand the moving parts.

What went wrong was mundane. A routine update pulled in a new kernel and a graphics stack change that did not agree with my ageing GPU, and on reboot I got a black screen and a flashing cursor. No console, no network, nothing useful. The sort of failure where, historically, you sigh and reach for the install media.

A server in a rack with status lights

Instead I dropped to the bootloader, booted an older entry that still had a working initramfs, and looked at what I had:

# btrfs subvolume list /
ID 257 gen 4412 top level 5 path @
ID 312 gen 4410 top level 5 path @snapshots/pre-upgrade-20180803

The rollback is conceptually simple. The broken subvolume is not the truth, it is just the one currently mounted as root. You move it aside and promote the snapshot in its place:

# mount /dev/sda2 /mnt           # the top-level subvolume
# mv /mnt/@ /mnt/@broken
# btrfs subvolume snapshot /mnt/@snapshots/pre-upgrade-20180803 /mnt/@
# reboot

That is it. Because a btrfs snapshot is a writable copy that shares all its blocks with the original through copy-on-write, creating one is effectively free and promoting one is just a metadata operation. No copying gigabytes around, no waiting. The system came back exactly as it had been before the upgrade, kernel and all, and I kept @broken around for a day so I could pick through the logs and work out what had actually upset it.

A few things I would underline if you are tempted to set this up:

The snapshot has to be taken before the change, automatically, every time. A safety net you take manually when you remember is a safety net you do not have.
Snapshots are not backups. They live on the same disk. If that disk dies, your snapshots die with it. I still run a separate offsite copy for anything I would actually cry about losing.
Keep the layout sane. Root in a subvolume, snapshots in a dedicated location, and know how your bootloader references them before you need to.

The whole weekend I had set aside for a careful reinstall evaporated, in the good way. I went to the pub instead. That is the real value of this stuff: not the cleverness of copy-on-write, but the hour on a Friday night that you get to keep.