Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

The Bug That Only Existed When I Was Not Looking

A crash that reliably reproduced in production and vanished the instant I attached strace, and the race condition hiding behind the observer effect.

A bug on a terminal

The worst bugs are not the ones that crash every time. Those are easy, in the sense that a thing you can reproduce is a thing you can eventually kill. The worst bugs are the ones that crash reliably until the precise moment you start watching them, at which point they behave impeccably and make you doubt your own sanity. I had one of those this week, and it taught me, again, the lesson I keep needing to relearn: when observing the bug makes it disappear, the act of observing is the clue, not the obstacle.

The setup was ordinary. A service that processed a queue of jobs, mostly fine, occasionally falling over with a segfault under load. Not often, maybe one job in a few thousand, but reliably enough that it showed up in the logs every hour or so and reliably enough to be unacceptable. So I did the obvious thing. I attached strace to watch the syscalls leading up to the crash.

And it stopped crashing. Completely. I ran it under strace for an hour, then two, pushing the same load that had been killing it every few minutes, and it was perfectly, infuriatingly healthy. The moment I detached strace, the crashes came back within minutes. The bug only existed when nobody was looking at it.

The observer effect is a timing tell

This is the textbook signature of a race condition, and the reason is mechanical rather than mystical. strace works by interposing on every system call, and that interposition is not free. It slows the traced process down, and crucially it slows it down unevenly, adding latency around syscalls specifically. If your bug depends on two things happening in a particular order within a tiny window, anything that changes the timing of that window can hide it. Strace does not fix the race. It just shifts the timing enough that the unlucky ordering stops happening.

So the disappearance was not noise to be worked around. It was the answer wearing a disguise. The bug was timing-dependent, which meant it was almost certainly a race over some shared state, and the segfault was the second-order consequence of that race rather than the bug itself.

Lines of code on a screen

Watching without disturbing

The trick with a Heisenbug is to observe it with as little perturbation as possible, ideally with something that does not sit in the hot path and warp the timing the way strace does. So I stopped trying to trace the syscalls and reached for tools that watch from the side.

First, the cheap version: keep the load running, take core dumps when it falls over, and read the corpses rather than trying to catch it in the act.

ulimit -c unlimited
echo '/tmp/cores/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern

A handful of cores later, the backtraces all pointed at the same place: a structure being read on one thread while another thread was halfway through freeing it. A use-after-free, which is exactly the kind of thing that segfaults sometimes, depending on whether the freed memory has been reused yet. That "sometimes" is the one-in-a-few-thousand. That dependence on what else the allocator is doing is why it is so timing-sensitive, and why a tool that nudges the timing makes it vanish.

To confirm without strace's heavy hand, I reached for a sampling profiler and for the sanitiser build, which is the proper tool for this and which I should have started with. A build with the thread sanitiser enabled does not change the timing the way strace does, or rather it changes it differently, and more usefully it tells you about the data race directly rather than waiting for the race to corrupt something badly enough to crash.

# rebuild with the race detector and run the reproducer
go build -race ./...
./service < reproducer-load

It flagged the race on the first run. Two goroutines, one reading a map while another deleted from it, no lock between them, with a comment three lines up cheerfully asserting that this path was single-threaded. It had been single-threaded, once, before a refactor six months ago quietly made it not. The comment was true when it was written and a lie ever since, which is the natural lifecycle of comments.

What I took from it

The fix was boring and correct: a mutex around the shared map, or rather the realisation that the map should never have been shared like that in the first place, and a small restructure so the ownership was clear. Bugs like this are rarely fixed by adding a lock and walking away. The lock makes the symptom stop, but the real fix is making the sharing impossible to do by accident, so the next refactor cannot reintroduce it.

There is a separate small lesson buried in here about the comment, which I want to dwell on because it is the part that actually caused the bug. The code was correct when it was written. A single-threaded path with no lock is perfectly fine, and the comment documenting that assumption was helpful and true. The failure was that nothing enforced the assumption. Six months later a refactor introduced a second caller from a different goroutine, the person doing it never saw the comment three lines up, and the compiler had no way to object because there was nothing to object to. The assumption lived in prose, and prose does not fail the build.

That is the real shape of a lot of these bugs. They are not introduced by someone doing something obviously wrong. They are introduced by someone doing something reasonable in a place where an unstated, unenforced assumption used to hold. The fix that lasts is the one that turns the assumption into something the machine checks: a type that cannot be shared, an ownership model the compiler understands, a lint that fails when the map is touched from two places. A comment is a wish. A constraint is a guarantee, and only guarantees survive contact with a busy team six months from now.

The bigger takeaway is about debugging temperament. When a bug runs away from your tools, that is information. A crash that hides from strace is telling you it lives in the timing, and timing bugs are races until proven otherwise. The instinct to add more, heavier instrumentation is exactly wrong, because the instrumentation is what is hiding it. You want lighter observation, post-mortem evidence, and a sanitiser that reasons about the race directly rather than hoping to catch it red-handed.

I lost the better part of an afternoon to a bug that was, in the end, a missing lock and an outdated comment. But I came away with the rule reinforced, which is worth the afternoon. If watching it makes it vanish, watching it is the clue. Stop fighting the disappearance and start asking what your observation is changing, because that, almost always, is the thing the bug depends on.