Most kernel panics I have met arrive once, scroll half a stack trace off the top of the console, and never return to explain themselves. This one was different, and that made it almost a pleasure. It happened every time I put real traffic across a particular 10GbE NIC, roughly ninety seconds in, like clockwork. A bug you can summon on demand is a bug you can kill.
The host is a fairly boring Debian box, one of the workhorses in the rack. It started panicking after I moved some storage replication onto the faster card. The first symptom was just the machine vanishing from the network, no logs, nothing in journalctl after a reboot because it never got the chance to flush.
getting the panic to talk
The first job was capturing the panic at all. A panic that takes the box down before it can write anything is useless to you. So I set up kdump, which reserves a slice of memory at boot and boots a tiny second kernel after the crash, just long enough to dump the failed kernel's memory to disk.
# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet crashkernel=256M"
apt install kdump-tools crash
update-grub
reboot
After that, the next panic left me a vmcore in /var/crash. Suddenly I had something to read instead of a photograph of a frozen console taken on my phone.
reading the corpse
crash /usr/lib/debug/boot/vmlinux-$(uname -r) /var/crash/*/dump.*
The backtrace pointed straight into the NIC driver, in the path that handles receive interrupts under load. Not my code, not anything I could obviously fix, but at least an honest answer. The combination of the driver version shipped with the distro kernel and the specific firmware on the card was the culprit, a known bad pairing if you went looking, which of course I only found after I had the panic to search for.
The fix was unglamorous. I pinned a newer kernel from backports, which carried a patched version of the driver, and the panic simply stopped. Ninety seconds of traffic, then two minutes, then an hour, then a week. Silence.
what I took away
The lesson is not really about this NIC. It is that a reproducible fault is a gift, and the time you spend making a fault reproducible and observable is almost always cheaper than the time you spend guessing. Set up kdump before you need it. Future you, staring at a blank console at midnight, will be grateful.