a kernel panic I could actually reproduce

A Linux terminal showing a stack trace

The best thing a misbehaving machine can do is misbehave on demand. For about three weeks one of my boxes had been rebooting itself at intervals that felt designed to make me look foolish: never while I was watching, always overnight, always with nothing in the logs because the journal hadn't flushed before the lights went out. An intermittent fault you can't reproduce is just a rumour.

The first useful move was to stop trusting that the logs would survive the crash. I set kernel.panic=10 so the box would at least pause before rebooting, and more importantly I got it logging the panic somewhere that wasn't the dying disk. A serial console into a second machine is the old reliable answer, and netconsole is the lazy modern one. I went with netconsole because I had a spare Pi sat there doing nothing useful.

modprobe netconsole [email protected]/eth0,[email protected]/

Suddenly I had the actual oops on another machine's screen instead of an empty journal and a sense of grievance. The trace pointed at the network driver, which was a surprise, because I'd been quietly blaming the RAM the whole time.

A close-up of a server's internals

Armed with a suspect, I went looking for the trigger. The reboots correlated with overnight backups, which hammer the NIC harder than anything else this box does. So I stopped waiting for nature and forced the issue: a tight loop of iperf3 saturating the link. The machine fell over in under four minutes. That was the moment the whole thing turned from a haunting into an engineering problem. Four minutes to reproduce means four minutes to test a fix.

From there it was ordinary, satisfying work. I memtested the RAM anyway, because you always feel a fool if you skip it and it turns out to be the RAM, and it was clean. The driver version shipped with the kernel had a known issue under load. A newer kernel from backports, a reboot, and then the real test: the same iperf3 loop that used to kill it in four minutes, left running for an hour. It held.

I let the overnight backups run for a week before I called it fixed, because a reproducer tells you the fault is gone, not that the machine is well. But the lesson I keep relearning is worth writing down. When something fails at random, the entire job is to make it stop being random. Capture the evidence somewhere it can't die with the box, find the variable that correlates, then turn that variable up until the failure happens on your schedule instead of its own. A panic you can summon is a panic that's already mostly solved.