the rare gift of a kernel panic you can reproduce

A server in a rack with status lights

Most kernel panics I have met were ghosts. The machine is up for weeks, then one morning it is wedged with a screen full of hex you can no longer scroll back through, and by the time you have power-cycled it the evidence is gone. You note the date, mutter something about cosmic rays or dodgy RAM, and move on, because you cannot debug what you cannot make happen again.

So when one of my homelab boxes started panicking and I found I could reproduce it on demand, I was almost pleased. A reproducible panic is a solvable panic. You stop being an archaeologist and become an experimenter, and that is a much nicer job.

the symptom

The box is a small server that does a few things, but the relevant one is a weekly backup that pushes a few hundred gigabytes over the network to another machine. The panic happened during the backup, not every time, but often enough that running the backup twice in a row would usually trigger it. That "usually" is the whole game. An intermittent fault you can provoke in five minutes is worth a hundred faults you wait a fortnight for.

The first job was not to fix anything. It was to capture the panic properly, because a photograph of a TV showing the last twenty lines is not a bug report, it is a souvenir.

capturing it properly

The problem with a panic is that the system that would normally write your logs is the system that has just died. The standard answer is to get the dying kernel's last words off the box before it goes. I had two routes and used both.

The first is netconsole, which makes the kernel spit its console output over UDP to another machine. You load the module pointed at a listener and from then on every kernel message, including the panic, lands in a log on a second box that is still alive to receive it.

modprobe netconsole [email protected]/eth0,[email protected]/00:11:22:33:44:55

On the other end I just sat there with nc -u -l 6666 writing to a file. The next time I ran the backup, the full panic, including the call trace that scrolls off the screen on real hardware, arrived intact in a text file I could actually read, search, and paste.

The second route, which I set up at the same time, was kdump. It reserves a slice of memory at boot, and when the kernel panics it boots a tiny second kernel out of that reserved region whose only job is to dump the memory of the dead one to disk. That gives you a vmcore you can open later with crash and poke at the actual kernel data structures at the moment of death. It is heavier to set up, but when the call trace alone is not enough it is the difference between a guess and an answer.

A server chassis with the lid off

reading the trace

With the panic captured as text, the top of the call trace pointed firmly into the network driver and its receive path, not into anything I had written, and not into the backup software. The oops itself was a null pointer dereference deep in the driver's interrupt handling, the sort of thing that only surfaces under sustained high throughput, which is exactly what a few hundred gigabytes of backup provides and ordinary day-to-day traffic never does.

That reframed the whole thing. This was not my configuration, not a userspace bug, and not failing hardware in the dramatic sense. It was a driver bug for this particular network chipset that you only hit when you push the card hard for a sustained period. The backup was not the cause, it was simply the only workload on the machine that ever generated enough traffic to find the bug.

A second view of server hardware in a rack

confirming the cause

The honest move at this point is to confirm, not assume. A call trace tells you where the kernel was standing when it fell over, not always why it fell. So I did two things.

First, I generated the same sustained load deliberately, without the backup software in the picture at all, just iperf3 hammering the link for a few minutes. It panicked the same way. That ruled out the backup tool entirely and confirmed the trigger was raw throughput through that network card.

Second, I checked the obvious upstream fix before doing anything clever. The kernel on this box was a little behind, and a newer point release carried changes to that exact driver. I would rather take a fix that someone has already written and tested than start patching a network driver myself on a Friday.

the fix, and the lesson

I updated to a newer stable kernel, kept iperf3 running for a solid quarter of an hour, then ran the backup twice back to back, the sequence that had reliably killed it before. It stayed up. I left the load running far longer than I needed to, because a panic that takes a fortnight to reappear is one you will convince yourself you have fixed when you have not.

A few things I will keep from this one. Set up netconsole and kdump before you need them, not during the outage; reserving the memory and configuring the listener while the box is healthy takes ten minutes and turns a future ghost into a readable text file. Find a deterministic trigger before you change anything, because every fix you try is worthless if you cannot tell whether it worked. And when the trace lands in the kernel's own code rather than yours, check whether a newer release has already fixed it before you reach for the source. Most of the time, with hardware this common, somebody got there first.

The genuinely satisfying part was none of the debugging. It was running the backup twice, watching the box stay up, and knowing for certain rather than hoping. That certainty is the entire reason a reproducible panic is a gift.