rebuilding my nas one disk at a time without losing my nerve

A server rack with a row of drive bays, one tray pulled half out

My home NAS had been full for months in the way that a NAS is always full: not actually out of space, just close enough that every new thing felt like an argument. The box runs TrueNAS, the pool was a six-drive RAIDZ2 of 4TB disks, and the disks were old enough that I trusted them slightly less every time I thought about them. So the plan was to do two things at once: grow the pool to 8TB drives, and retire the oldest disks before they retired themselves. The trick with ZFS is that you can do this live, one disk at a time, without ever taking the array down, as long as you have the patience to wait for each resilver and the nerve not to touch anything while it runs.

the one-disk-at-a-time dance

The principle is simple. In a RAIDZ2 vdev you can replace a member disk with a larger one, and ZFS will resilver the new disk to match. The vdev keeps reporting its old, smaller capacity, because a vdev is only as big as its smallest member. But once every disk in the vdev has been swapped for a bigger one, the spare capacity becomes available, either automatically or after a nudge, and the pool grows. No backup-restore, no recreating datasets, no downtime beyond the slight performance hit of a resilver in progress.

Before any of that, I made a point of doing the one thing everyone skips and then regrets: I checked that my backups were real. A disk migration is the single most likely moment to lose a pool, because you are deliberately removing redundancy one drive at a time, and a RAIDZ2 with one disk offline and another resilvering is running on a thinner safety margin than it looks. If a second surviving disk had failed mid-resilver, RAIDZ2 would have carried it, but I did not want to find out experimentally whether my luck stretched to a third. So I confirmed the offsite copy of the things I actually could not lose was current before I pulled a single tray. The pool is convenience. The backup is the safety net. Treating the pool as the safety net is how people post very sad forum threads.

So the loop, repeated six times, was:

# offline the old disk
zpool offline tank gptid/<old>

# physically swap the tray, then replace in the pool
zpool replace tank gptid/<old> gptid/<new>

# watch the resilver, then go and do literally anything else
zpool status tank

Each resilver on this hardware took the better part of a day for a reasonably full 4TB member. Six disks, six resilvers, the better part of a week of calendar time, although almost none of it was my time. That is the bit people get wrong when they hear "a week." The array was online and serving the whole time. I was just waiting.

The one flag that matters at the end is autoexpand. By default the pool will not grow into the new space even after every disk is replaced, because growing a pool is the sort of thing you want to be deliberate about. Set it on, and the pool expands to the new minimum-disk size once the last resilver completes.

zpool set autoexpand=on tank

A close-up of zpool status part way through a resilver, the percentage crawling upward

I had read enough horror stories to set this before the final replace rather than scrambling for it after, and to double-check it had actually taken effect rather than assuming. Assuming is how you end up with a pool full of 8TB disks reporting 4TB of capacity and a forum thread full of people who did exactly that.

the disk i was glad to lose

The part of this I did not plan for was the most useful. ZFS resilvering is, in effect, a read of every block on the surviving disks to reconstruct the replacement. That is a full surface read of the array, repeated once per disk swap. Partway through the third resilver, one of the remaining old disks started logging read errors. Not enough to fault out, but enough that ZFS flagged checksum corrections and the SMART counters started climbing.

$ zpool status tank
  scan: resilver in progress
config:
        NAME            STATE     READ WRITE CKSUM
        tank            DEGRADED     0     0     0
          raidz2-0      DEGRADED     0     0     0
            gptid/...   ONLINE       0     0     0
            gptid/...   ONLINE       0     0    14
            replacing   ONLINE       0     0     0

Fourteen checksum corrections on a disk I had been planning to keep. That disk moved straight to the top of the replacement queue. The point is that I found a dying disk while I had full redundancy and was already in the middle of a controlled migration, rather than three months later during a real failure when the safety margin was thinner. The act of churning the whole array is itself a stress test, and stress tests find the weak member. This one found it on my schedule instead of its own.

One detail worth flagging for anyone planning the same thing: resilver order matters less than resilver health. I had assumed I would replace the oldest disks first and work towards the newest, on the theory that the oldest were likeliest to die. In practice the array told me its own preferred order through the checksum counters, and I followed that instead. The disk throwing corrections jumped the queue regardless of its age. Let the data decide which drive is most at risk, rather than your assumptions about which one is oldest, because the two are not always the same disk.

I ran a full scrub once the dust settled, which came back clean, and the pool is now six 8TB disks with the original data intact and a comfortable amount of room to be wasteful again. It cost a week of patience and a careful eye on zpool status. It cost no downtime and, more to the point, no data. The lesson, if there is one beyond "ZFS is very good at this," is that the safest time to find a failing disk is when you are looking at the whole array on purpose, with full redundancy, and nothing on fire. So if you are going to grow a pool anyway, treat the resilvers as a free health check and watch the checksum columns on the disks you are not replacing. They will tell you which one to do next.