Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

a kernel panic i could actually reproduce

A NIC driver was panicking under load, and for once the panic was reproducible, which made it almost a pleasure to chase down.

A Linux terminal full of kernel log output

Most kernel panics are ghosts. The box falls over at 3am, the screen is blank, and all you have is a stack trace you might have caught if kdump was configured, which of course it wasn't. You stare at it, you shrug, you reboot, and you wait for it to happen again at the worst possible moment.

This one was different, and I am still slightly grateful. A particular box would panic every time we pushed real traffic through a second NIC. Not sometimes. Every time. Within about ninety seconds of the load test starting, reliably, with the same trace pointing into the network driver. A reproducible panic is a gift. It means you can actually do science instead of superstition.

So I set up kdump properly, captured the crash, and walked the trace. It was a null pointer dereference in the driver's receive path, the sort of thing that only shows up when the ring buffer is under genuine pressure. A quick search through the changelog for a newer kernel turned up a fix for exactly that, merged a few releases after the one we were pinned to.

We bumped the kernel, ran the same load test, and it sat there happily doing its job. No panic. The reproducer that had been our enemy for two days became the regression test that proved the fix.

The moral, such as it is: configure kdump before you need it, and treat a reliably reproducible crash as the good day, not the bad one. It is the intermittent ones that age you.