Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

the panic that turned up on demand

A box that panicked every few days finally became reproducible once I matched the crashes to a backup job hammering an out-of-tree storage driver, turning an intermittent ghost into a five-minute repro and a clean fix.

A Linux terminal showing a kernel oops trace

A server had been panicking every few days, and an intermittent kernel panic is about the most demoralising bug there is. The box reboots, comes back, behaves perfectly, and leaves you with nothing but a watchdog reboot in the logs and a faint sense that the universe is mocking you. You cannot fix what you cannot make happen. So the entire job, before any fixing, was to make it happen on purpose.

The first real break was getting the panic itself, not just the silence after it. The box was rebooting before anything useful reached disk, so I set up kdump to capture a crash dump, and pointed the serial console at a logging host so a trace could escape even as the kernel died. That cost an afternoon and bought me the one thing I lacked: the actual oops, with a call stack.

A rack server of the kind the dump came from

The stack pointed straight at an out-of-tree storage driver. Not the mainline kernel, not our application, a vendor module we had loaded for a particular HBA. That narrowed the search enormously, because now I was not staring at the whole system, I was staring at one driver under one kind of pressure.

Then came the part that actually solved it: correlating the timestamps. I lined the panic times up against cron, and they were not random at all. Every crash sat within the window of the nightly backup job, the one that does a deep read across every block of the array. The panic was rare only because it needed that driver under sustained heavy I/O, and the backup was the only thing that pushed it there. "Every few days" was really "whenever the backup happened to hit the right access pattern".

That turned a ghost into a five-minute repro. Instead of waiting for the backup, I ran a brute-force read against the array by hand:

fio --name=thrash --filename=/dev/sdX --rw=randread \
    --bs=4k --iodepth=64 --numjobs=8 --runtime=300 --direct=1

Within a couple of minutes, panic, same stack, on demand. There is a particular relief in deliberately crashing a machine you have been chasing for a fortnight. It means you have understood it.

With a reliable repro the fix was almost an afterthought. The vendor had a newer build of the module that named exactly this fault, a race under high queue depth, in its changelog. I updated the driver, ran the same fio thrash that had crashed it every time, and it held. I left the load running for an hour to be sure, then put the backup back to normal.

The lesson is the one I keep paying for: an intermittent bug is usually a deterministic bug whose trigger you have not found yet. The work was never about the fix, which was someone else's one-line race. It was about turning "sometimes, mysteriously" into "every time I do this", and the route there was capturing the crash properly and then asking what was running when it happened.