The worst bugs aren't the loud ones. They're the ones that happen once a fortnight, take a box down hard, and leave you nothing but a vague "it rebooted" in the monitoring. We had one of those on a handful of machines for about six weeks. The breakthrough wasn't fixing it. The breakthrough was making it happen on demand.
the symptom
The pattern, such as it was: a host would drop off the network, the watchdog would eventually kick it, and it would come back with no clean shutdown logged. Uptime reset, no application errors, nothing in the app logs because the app never got a chance to complain. Just gone, then back.
The first job was getting any evidence at all. A panicking kernel can't write to its own disk reliably, so the on-disk logs were useless: the box was already on fire by the time it would have flushed anything. Two things changed that.
First, kdump. If you can get a crash kernel to boot after the panic, it'll dump a vmcore you can actually read later. Make sure it's enabled and that you've reserved memory for the crash kernel:
# is the crash kernel reserved at boot?
cat /proc/cmdline | grep -o 'crashkernel=[^ ]*'
# is kdump actually armed?
systemctl status kdump
kdumpctl status
Second, and this is the one that saved me, netconsole. If the machine can still get a packet out in the moment before it dies, you can stream the kernel log to another host over UDP. It's crude and it works:
modprobe netconsole \
[email protected]/eth0,[email protected]/00:11:22:33:44:55
On the listener I just sat with nc -u -l 6666 writing to a file. Now the dying gasps of the kernel went somewhere that wasn't the dying machine.
making it reproducible
Catching the trace once would have been progress. The real win was noticing the correlation. Every crash I had a rough timestamp for lined up with a burst of network throughput, specifically large transfers hammering one particular NIC. That gave me something to poke at deliberately instead of waiting around.
So I stopped being patient. On a drained host I pointed iperf3 at it and ran sustained multi-stream transfers, then layered on some ethtool ring-buffer fiddling to stress the driver's allocation paths. It took maybe forty minutes the first time, then under ten once I'd found the right shape of load. A bug you can trigger in ten minutes is a bug you can fix.
the trace
With netconsole capturing, the panic finally showed its face. The call stack walked through the NIC driver's receive path into a null pointer dereference, the sort of thing that happens when a buffer gets freed while something still holds a reference to it. The exact frames pointed at the driver, not at our code, which was both a relief and an annoyance: relief because it wasn't something I'd written, annoyance because driver bugs are someone else's release schedule.
A quick search of the kernel changelogs turned up a fix in a later point release for that exact driver, describing a use-after-free under heavy receive load. That matched the trace and matched the trigger. I didn't have to understand every line of the patch to be confident it was our bug, the description read like a transcript of my last six weeks.
the fix, and the lesson
The fix was unglamorous: move those hosts onto a kernel that carried the driver fix, then confirm by re-running the exact load that used to kill them. It survived a sustained soak that would previously have taken it down in minutes. Done.
The thing I'll keep from this isn't the driver bug. It's that an intermittent crash stays unsolvable right up until the moment you can summon it. Everything before that is guessing. Once you can make it happen whenever you like, reading the trace and finding the fix is almost mechanical. So when something flakes rarely and badly, the first hour is best spent not on theories but on instrumentation and a reliable trigger. Get the bug to come to you on a lead, and the rest follows.