TrueNAS and the Great Disk Shuffle

A server rack with drive bays

My TrueNAS box was full. Not "getting full", full, the kind where the weekly backup job had started failing and I'd been quietly deleting things to buy a few more days. The pool was a six-drive RAIDZ2 of 4TB disks, and the only sensible way to grow it without rebuilding from scratch was the disk shuffle: replace each drive with a larger one, one at a time, let ZFS resilver after each swap, and once the last 4TB disk is gone the pool quietly grows to fill the new space.

It works. It is also, for about four days, mildly terrifying.

The principle is simple. RAIDZ2 tolerates two failed disks, so pulling one to replace it leaves you with one disk of redundancy still in hand. ZFS rebuilds the new disk from parity, you check it's healthy, and only then do you touch the next one. The pool stays online throughout, which is the whole appeal. The downside is that every resilver runs the surviving disks hard for hours, and a resilver is exactly the workload most likely to find the second failure you didn't know you had.

So I did the boring, correct things first. A full scrub before starting, to flush out any latent errors while I still had full redundancy. A current backup verified actually restorable, not just "the job says it ran". And SMART long tests on the new drives before they went anywhere near the pool, because the worst outcome is replacing a good disk with a bad one and finding out mid-resilver.

The swap itself is one command per disk:

zpool replace tank /dev/disk/by-id/old-serial /dev/disk/by-id/new-serial

Then you wait, and you watch zpool status, and you try not to refresh it every thirty seconds. Each 4TB resilver took the better part of a working day. Six disks, one at a time, with a fresh scrub-of-the-nerves between each, turned into the better part of a week of the array running warm and me checking on it like a worried parent.

A homelab shelf with drives and cabling

One real scare. On the fourth resilver, zpool status started showing a handful of checksum errors on one of the old disks, the ones I wasn't even replacing. Not enough to fault it, but enough to make my stomach drop, because a second disk misbehaving during a rebuild is precisely the scenario RAIDZ2's second parity disk exists to survive, and I was now leaning on it. It held. ZFS corrected the errors from parity, the resilver finished clean, and a follow-up scrub came back spotless. But it was a pointed reminder that the disk shuffle isn't free: you are deliberately spending redundancy, and old disks under sustained load are where surprises live.

The last swap is the satisfying one. The moment the final 4TB disk is gone and the pool is all larger drives, the new capacity just appears, assuming autoexpand is on, which it is by default but is worth checking with zpool get autoexpand tank before you start so you're not left poking at it afterwards.

Would I do it this way again? Yes, but with one change. I'd resilver overnight where I could, so the heavy load happened while nothing else was hitting the array and while I was asleep rather than hovering. The shuffle is the right approach for an in-place grow, and ZFS handled the whole thing with exactly the calm competence I've come to rely on. The only thing that needed managing, in the end, was me.