switching root onto zfs, carefully

A Linux terminal showing a zpool status listing

I have run ZFS on data pools for years and trusted it completely. What I had not done until last week was put root itself on ZFS, on a machine I actually care about. The appeal is obvious: boot environments, snapshots before every upgrade, and the ability to roll back a botched apt dist-upgrade in seconds rather than reaching for a rescue USB. The reservation is equally obvious: if you get the boot path wrong, the machine does not come up and you are debugging an initramfs at half eleven on a Sunday.

So this is the careful version. Ubuntu 16.04 with ZFS 0.6.5 from the archive, a machine with two spare SSDs, and a willingness to start again if it went sideways.

why bother at all

The honest answer is the snapshots. On a data pool I take them on a schedule and forget about them. On root, the killer feature is the boot environment: take a snapshot, clone it, and you can boot the old state if the new one is broken. That turns "I'm scared to upgrade this box" into "let's upgrade it and see". For a homelab that runs more services than it strictly should, that change in posture is worth a lot.

Compression is a quiet bonus. lz4 is effectively free on modern hardware and shrinks a root filesystem more than you would expect, mostly logs and package caches.

the boot pool problem

GRUB's ZFS support is real but it is fussy. It does not understand every feature flag that a modern pool enables, and if you create your root pool with all the defaults, GRUB may refuse to read it. The accepted answer in 2018 is a small separate pool, conventionally bpool, created with a deliberately conservative set of features, holding only /boot. The big pool, rpool, holds everything else and can use whatever features you like.

# boot pool: conservative features so GRUB can read it
zpool create -o ashift=12 -d \
  -o feature@async_destroy=enabled \
  -o feature@bookmarks=enabled \
  -o feature@embedded_data=enabled \
  -o feature@empty_bpobj=enabled \
  -o feature@enabled_txg=enabled \
  -o feature@extensible_dataset=enabled \
  -o feature@filesystem_limits=enabled \
  -o feature@hole_birth=enabled \
  -o feature@large_blocks=enabled \
  -o feature@lz4_compress=enabled \
  -o feature@spacemap_histogram=enabled \
  -O acltype=posixacl -O canmount=off -O compression=lz4 \
  -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
  -O mountpoint=/boot -R /mnt \
  bpool /dev/disk/by-id/SOME-ID-part3

# root pool: enable what you like
zpool create -o ashift=12 \
  -O acltype=posixacl -O canmount=off -O compression=lz4 \
  -O dnodesize=auto -O normalization=formD -O relatime=on \
  -O xattr=sa -O mountpoint=/ -R /mnt \
  rpool /dev/disk/by-id/SOME-ID-part4

Use by-id paths, not /dev/sda. The first time the kernel renames your disks on reboot you will understand why.

datasets, not one big blob

The point of doing this properly is the dataset layout. Separate datasets mean separate snapshot policies, separate properties, and the ability to exclude things from your root boot environment that have no business being rolled back.

zfs create -o canmount=off -o mountpoint=none rpool/ROOT
zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/ubuntu
zfs mount rpool/ROOT/ubuntu

zfs create                 -o mountpoint=/home   rpool/home
zfs create -o canmount=off -o mountpoint=/var    rpool/var
zfs create                 -o mountpoint=/var/log rpool/var/log
zfs create                 -o mountpoint=/var/lib/docker rpool/var/docker

The rpool/ROOT/ubuntu dataset is the thing a boot environment snapshots. Keeping /var/log and Docker's storage in their own datasets means a rollback of root does not also throw away a week of logs or every container image you just pulled. That distinction matters more than it sounds.

the bits that bite

A few things caught me out, and they are the reason I am writing this down rather than just linking the upstream HOWTO.

The initramfs needs the ZFS modules and the right hooks, and you must regenerate it after every meaningful change with update-initramfs -u -k all. Forget this and you boot to a prompt.
zfs set mountpoint is not the same as an /etc/fstab line. Datasets mount themselves from properties stored in the pool. Leaving stale fstab entries pointing at the old root is a good way to confuse yourself.
GRUB needs to be told the root is ZFS. On a successful install update-grub works it out, but check the generated linux line actually says root=ZFS=rpool/ROOT/ubuntu before you reboot, not after.
Swap on a zvol is fine but set it up deliberately and do not enable it inside the boot environment in a way that breaks suspend.

A diagram-like view of a server's internals

did it work

It did, on the second attempt. The first attempt failed because I created bpool with the default feature set and GRUB sat there blinking at me. The rebuild with the conservative feature list booted first time.

What I have now is a root I am no longer nervous about. Before the next kernel upgrade I run zfs snapshot rpool/ROOT/ubuntu@pre-upgrade, do the upgrade, and if it misbehaves I roll back and reboot. The whole thing takes seconds and I sleep better for it.

If you only run ZFS on data, this is the logical next step, but do it on a machine you can afford to reinstall first. Get the boot pool right and the rest is just careful plumbing.