Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

a kernel panic with the decency to be repeatable

A reliably reproducible kernel panic on a homelab box that turned out to be a faulty driver path triggered by a specific NIC operation.

A terminal showing a Linux kernel boot log

Most kernel panics are ghosts. They happen at 3am, leave a half-corrupted trace nobody captured, and never come back to be questioned. So when a box started panicking on a specific, repeatable action, I was almost pleased. A bug you can summon on demand is a bug you can kill.

The trigger was bringing a particular network interface up and down in quick succession. Once was fine. Do it twice within a second or two and the box would lock, then spit a stack trace ending somewhere deep in the NIC driver. Reliably. Every time. I could make it fall over on cue, which after years of unreproducible weirdness felt like a small gift.

With a serial console capturing the full trace, the call stack pointed at a use-after-free in the driver's teardown path, the kind of race that only shows up when you bring the interface down before it's finished coming up. I confirmed it wasn't my hardware by reproducing it on a second, identical box. Then I checked the changelog for the stable kernel, found a fix for exactly this driver and exactly this race already landed upstream, pinned to a newer point release, rebooted, and the panic was gone. The interface flap loop that used to kill the box in seconds now runs all day.

The moral isn't subtle: a reproducer is ninety percent of a fix. Capture the trace properly, prove it on a second machine, and check whether someone has already done the hard part for you upstream. They usually have.