Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the bug that only happened when nobody was watching

A race condition in a worker that disappeared the moment I attached strace, and the timing that finally gave it away.

A terminal full of system call output

The worst kind of bug is the one that fixes itself the moment you look at it. We had a queue consumer that would, perhaps once every few thousand jobs, drop a message on the floor. No error, no log line, nothing in the dead-letter queue. The job simply never happened. Reproducing it locally was impossible because locally it never failed.

So I did the obvious thing and attached strace -f -p $PID. And of course it stopped failing. Completely. Ran clean for an hour. The instant I detached, it started dropping jobs again within minutes. A textbook heisenbug: the act of observing it changed the timing enough to hide it.

That last bit is the clue, though. If slowing the process down makes the bug go away, the bug is about timing. strace adds latency to every syscall, which is enough to perturb a race. So instead of trying to catch it in the act, I started reasoning about what two things could be racing.

A close-up of source code on screen

The consumer acknowledged the message to the broker, then handed the payload to a worker goroutine via an unbuffered channel. The acknowledgement and the handoff were in the wrong order. We told the broker "done, don't redeliver" and only afterwards tried to put the work somewhere. If the worker pool was momentarily full and the context deadline fired in that window, the send was abandoned. Message acknowledged, work discarded. No error because the cancellation was, as far as the code was concerned, a perfectly normal shutdown path.

strace hid it because the extra microseconds per syscall meant the worker pool was effectively never full at exactly the wrong instant. Reorder the timeline and the gap closes.

The fix was three lines: do the work first, acknowledge second, and make the redelivery idempotent so a crash between the two doesn't double-process. I confirmed it not with strace but by adding a counter that incremented every time the send-to-worker path was abandoned, then hammering it with a load test that kept the pool saturated. Before the fix, the counter ticked. After, it stayed at zero through ten million jobs.

The lesson I keep relearning: when a probe makes a bug vanish, the probe is telling you something. Don't fight it for a clean reproduction. Ask what about the timing it changed, and follow that back to the race.