Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

a kernel panic i could actually reproduce

Tracking down a reproducible kernel panic on a homelab host, from the first useless stack trace to a kdump that finally pointed at the right driver.

A Linux terminal mid-debug

Most kernel panics I have met arrive once, scroll half a stack trace off the top of the console, and never return to explain themselves. This one was different, and that made it almost a pleasure. It happened every time I put real traffic across a particular 10GbE NIC, roughly ninety seconds in, like clockwork. A bug you can summon on demand is a bug you can kill.

The host is a fairly boring Debian box, one of the workhorses in the rack. It started panicking after I moved some storage replication onto the faster card. The first symptom was just the machine vanishing from the network, no logs, nothing in journalctl after a reboot because it never got the chance to flush.

The server in question, lights and all

getting the panic to talk

The first job was capturing the panic at all. A panic that takes the box down before it can write anything is useless to you. So I set up kdump, which reserves a slice of memory at boot and boots a tiny second kernel after the crash, just long enough to dump the failed kernel's memory to disk.

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet crashkernel=256M"
apt install kdump-tools crash
update-grub
reboot

After that, the next panic left me a vmcore in /var/crash. Suddenly I had something to read instead of a photograph of a frozen console taken on my phone.

reading the corpse

crash /usr/lib/debug/boot/vmlinux-$(uname -r) /var/crash/*/dump.*

The backtrace pointed straight into the NIC driver, in the path that handles receive interrupts under load. Not my code, not anything I could obviously fix, but at least an honest answer. The combination of the driver version shipped with the distro kernel and the specific firmware on the card was the culprit, a known bad pairing if you went looking, which of course I only found after I had the panic to search for.

The fix was unglamorous. I pinned a newer kernel from backports, which carried a patched version of the driver, and the panic simply stopped. Ninety seconds of traffic, then two minutes, then an hour, then a week. Silence.

what I took away

The lesson is not really about this NIC. It is that a reproducible fault is a gift, and the time you spend making a fault reproducible and observable is almost always cheaper than the time you spend guessing. Set up kdump before you need it. Future you, staring at a blank console at midnight, will be grateful.