Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

a kernel panic that did me the courtesy of being repeatable

A homelab host that panicked on a specific workload, and how a reliably reproducible crash turned a nightmare into an afternoon.

Linux terminal showing a kernel trace

Most crashes are cruel because they're random. They happen at 4am, leave nothing useful behind, and vanish the moment you start watching. So when one of my hosts started panicking and I realised I could trigger it on demand, I was almost cheerful about it.

The pattern was specific: kick off a large rsync onto the ZFS pool and within a minute or two the box would lock hard, then panic. Every single time. That repeatability is gold. A bug you can summon is a bug you can corner.

I set up a serial console so I could actually see the trace instead of guessing, and turned on persistent logging:

$ journalctl -k -b -1 | tail -40

The trace pointed squarely at the storage stack under memory pressure. The box had 8GB of RAM and I'd let ZFS ARC grow unbounded, so a heavy write would squeeze the kernel into a corner and something would give.

The fix was unglamorous: cap the ARC.

# /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=3221225472

Reboot, run the same rsync that had killed it ten times in a row, and it just... worked. No panic, no lock-up, just a slightly slower copy and a host that stayed up.

The whole thing took an afternoon, almost all of which was the diagnosis. The lesson I keep relearning: a reproducer is most of the battle. If you can make it break on command, you're not debugging any more, you're just narrowing down.