Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the bug that only existed when nobody was watching

A flaky service crash that disappeared the moment I ran it under strace, and what the timing actually told me once I stopped fighting it.

A terminal full of a stack trace

The point first, because it took me half a day to get there: the bug was a race, and strace "fixed" it by being slow enough to let the loser win. Classic heisenbug. The act of observing it changed the timing enough that the thing I was trying to catch stopped happening.

The service in question was a small Go daemon that fanned out work to a handful of goroutines and wrote results to a file. Maybe one run in thirty it would panic with a nil pointer deref deep in the write path. One in thirty is the worst possible failure rate. Too rare to reproduce on demand, too common to ignore, and just frequent enough to wake something up at three in the morning.

the part where observing changes the result

My first instinct was the right tool for the wrong reason. I wrapped it in strace -f to watch the syscalls around the write:

strace -f -e trace=write,openat -o /tmp/trace.log ./daemon

And of course it ran clean. Twenty times. Forty times. Not a single panic. At which point a sensible person concludes the tool is masking the bug, and I am occasionally a sensible person. Every syscall strace traps adds a context switch and a chunk of latency. That latency was enough to reorder my goroutines so the dangerous interleaving never landed.

A close-up of source code on screen

So I stopped trying to see the bug and started reasoning about what could only fail under speed. The shape of "fails when fast, passes when slow" almost always means two things are touching the same state without agreeing on who owns it. I went looking for shared state, and there it was: a results struct that one goroutine could nil out during shutdown while another was still mid-write. Under load, with no tracer slowing things down, the writer occasionally lost the race and dereferenced a nil.

the actual fix

The fix was boring, which is how you know it's correct. A sync.RWMutex around the lifecycle of the shared struct, and a check that shutdown waits for in-flight writers via a sync.WaitGroup before tearing anything down. Then I reached for the tool I should have started with:

go run -race ./...

The race detector found it instantly and pointed at the exact two lines. No timing games, no praying. It instruments the memory accesses directly rather than perturbing the schedule, so it catches the conflict regardless of who happened to win on a given run.

The lesson I keep having to relearn: strace is brilliant for "what is this process actually asking the kernel to do," and useless for "why does this concurrency bug only happen sometimes." For the second question, reach for the thing built to answer it. The race detector existed the whole time. I just enjoy suffering first.