Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

the rare gift of a kernel panic that happened on demand

Tracking a reproducible kernel panic on a Linux box to a specific NIC driver path, captured over netconsole and pinned down with bisection.

A terminal showing a Linux kernel oops trace

Most kernel panics are a gift you cannot open. The machine locks, the screen freezes mid-stack-trace, and by the time you have walked over to it the evidence is gone and it has rebooted itself out of guilt. You are left with a vague feeling and an uptime counter reset to zero. So when this one turned out to be reproducible, on demand, every single time, I was almost happy about it.

The symptom was a box that fell over hard whenever a particular backup job kicked off over the network. Not a soft hang. A full panic, the kind that takes the whole machine with it. And because backups run at 02:00, I had been finding it dead each morning with nothing to show for the night.

Catching the trace

The first job was not fixing it, it was seeing it. A panic that vanishes is useless. Two things made the difference.

First, netconsole. You point the dying kernel's log output at another machine over UDP, and it keeps shouting right up until the moment it dies. Loaded on the target:

modprobe netconsole netconsole=@/eth0,[email protected]/

and a plain nc -u -l 6666 on my laptop catches every line the box emits, including the panic, because the network stack stays alive longer than the local console does.

Second, I made it happen on purpose. If a thing only breaks at 2am you are debugging blind. I worked out that the trigger was sustained throughput on the NIC, so I stopped waiting for the backup and ran iperf at it until it fell over. Down from once a night to once a minute. That changes everything.

A rack-mounted server with a network cable highlighted

Reading the oops

The trace, once I had it on screen, pointed at the network driver. A null pointer dereference deep in the receive path of the card's driver, the same handful of frames every time. That consistency is the tell: a genuine hardware fault tends to wander, a software bug lands in the same place. This landed in the same place.

The call trace ran through the driver's NAPI poll routine into something that was clearly walking a structure that had already been freed. Use-after-free in the kernel does not politely return an error; it corrupts whatever happens to be sitting there now, and then everything falls down.

Bisecting the cause

I had recently moved this box to a newer kernel from the distribution's backports, chasing an unrelated fix. So I had a known-good and a known-bad, which is the ideal starting position for git bisect even when you are bisecting packaged kernels rather than building your own. Install a kernel, boot it, hammer it with iperf, mark good or bad, repeat.

A handful of reboots later it landed on a change in that specific driver between the two versions. The bug was real and already known upstream: a race in the receive path under heavy load that had been introduced and then fixed a few revisions later. I was sitting on the unlucky version in between.

The fix, and the lesson

The actual fix was almost an anticlimax. Pin the NIC to a known-good driver version, or move forward to the kernel where the race was already patched. I chose to move forward, hammered it with iperf for an hour to be sure, and the backup job ran clean that night for the first time in a fortnight.

What I took away was not really about that driver. It was about the value of reproducibility itself. A bug you can summon on command is most of the way to solved, because every technique we have, bisection, instrumentation, narrowing the inputs, depends on being able to ask the question again and get the same answer. The hard bugs are not the dramatic ones. They are the ones that happen once a week and never when you are watching. Spend your effort on making the failure happen reliably, and the rest tends to follow. A panic you can reproduce is, genuinely, the easy kind.