the rare gift of a kernel panic that reproduces on demand

A Linux terminal showing kernel log output

Most kernel panics in production are ghosts. The box reboots, you find a half-truncated trace in the logs if you are lucky, and you spend a fortnight wondering whether it was the hardware, the firmware, the moon, or you. So when a box started panicking and I could make it happen again on demand, I was almost delighted. A reproducible panic is not a problem; it is a present.

The symptom was a hard lockup on one of our older file servers, a 5.x kernel, whenever a particular batch job hammered an NFS mount with thousands of small concurrent stats and opens. It would run fine for an hour, then drop off the network entirely. The first useful step was getting the panic off the screen and into a file, because a trace you cannot read twice is barely a trace at all.

Catching the trace

The machine had no serial console worth the name, so I leaned on kdump. If you run servers and you have not set this up, do it before you need it:

# allocate crash kernel memory on the kernel cmdline
crashkernel=256M

# then
systemctl enable --now kdump
kdumpctl status

With kdump armed, the next panic dropped a vmcore into /var/crash instead of just freezing. That alone changed everything. I now had the same backtrace to stare at as many times as I liked, instead of squinting at a photo of a monitor taken on my phone.

A rack server with status lights

Building the reproducer

The batch job was too big to use as a test, so I boiled it down. A short loop that opened, stat-ed and closed a few thousand files across the NFS mount in parallel, run from a handful of background workers, tripped the fault in under two minutes. Something like:

for i in $(seq 1 8); do
  ( for f in /mnt/nfs/spool/*; do stat "$f" >/dev/null; done ) &
done
wait

Crude, but it panicked the box reliably, which is the only quality a reproducer needs. Once I had that, I could bisect behaviour: it only fired against this one NFS server, only with a specific mount option set, and never against a local filesystem. That narrowed it from "the kernel hates me" to "an interaction between this NFS client code path and that server under concurrency."

What it actually was, and the lesson

The fix, in the end, was a kernel update. The backtrace pointed squarely at the NFS client, the call path matched a bug already fixed upstream in a later stable release, and moving to a patched kernel made the reproducer boring: it ran the loop, finished, and the box stayed up. No clever patch of my own, no heroics.

The lesson was not about NFS. It was that the entire fight is getting from "it sometimes falls over" to "it falls over when I do exactly this." Set up kdump in advance, treat a reproducer as the actual deliverable, and the diagnosis tends to follow. The bug was someone else's to fix. Making it visible was mine.