a kernel panic i could actually reproduce

A terminal showing a Linux kernel oops trace

Most kernel panics I've met have been ghosts. The box is fine for three weeks, then it's a brick at 4am, and by the time you're looking there's nothing left to look at but an empty console and a vague memory of the lights being off. You reboot, you mutter something about cosmic rays, you move on. You never really know.

This one was different, and that's why I'm writing it down. This one I could reproduce on demand, and a reproducible kernel bug is a gift.

The box is a small Xeon server I use for builds and the occasional VM. It started panicking under heavy I/O, specifically when I ran a large rsync onto its ZFS pool while a build was hammering the CPU. Annoying, but the magic word there is "when". I had a recipe.

A 1U server pulled out on rails, lid off, drives visible

First job: catch the actual trace. A panic that scrolls off a console you weren't watching is no use, so I set up netconsole to fire the kernel messages at another machine over UDP.

modprobe netconsole [email protected]/eth0,[email protected]/

Then on the listener, a plain nc -u -l 6666 capturing to a file. Ran the rsync, kicked off the build, waited. About ninety seconds in, the screen went, and over on the listener I had the whole thing: a NULL pointer dereference deep in the ZFS ARC reclaim path, with a stack trace I could actually read.

That changed the problem entirely. It wasn't random. It was the ARC trying to give memory back under pressure and tripping over itself doing it. The CPU-heavy build was the accomplice, not the culprit: it was squeezing free memory hard enough that ZFS had to start reclaiming aggressively, and the reclaim path was where the bug lived.

The fix, once I knew that, was almost boring. I was on an older ZFS-on-Linux build than I'd thought, a version behind where this exact reclaim path had been reworked. I'd been putting off the upgrade because the box was "stable", a word that was doing a lot of quiet work given it panicked under load. I updated the package, rebuilt the module, rebooted, and ran my recipe again. And again. And a third time, because I didn't quite believe it. No panic.

I also capped the ARC while I was in there, because letting it eat right up to the memory ceiling and then fight the build for the scraps was asking for trouble regardless:

options zfs zfs_arc_max=8589934592

Eight gigabytes, leaving plenty of headroom for the builds, which was the actual workload I cared about. The pool doesn't need every spare byte of RAM as cache to do its job, and giving it a hard ceiling means the reclaim path simply gets exercised far less violently.

The genuinely satisfying part wasn't the fix. It was the half hour with netconsole and the recipe, turning "the server sometimes dies" into "the ARC reclaim path dereferences a null pointer under memory pressure on this version". Once a bug has a sentence that specific, it's basically already solved. The rest is admin.

If there's a lesson, it's this: the instant a panic is reproducible, stop guessing and go get the trace. netconsole, a serial console, kdump, whatever you've got. The trace is the whole game. Everything I did after reading mine was obvious. Everything before it was superstition.