the bug that only existed when nobody was watching

Terminal showing a stack trace

A heisenbug is a bug that changes its behaviour when you try to observe it, named after the physicist for reasons that feel cute right up until you have one. I had one this week. A small data-ingest service that crashed roughly once every few hours under load, and which became perfectly, infuriatingly well-behaved the moment I attached a debugger or strace to it. The act of looking made it stop.

This is the worst kind of bug to have, because every tool you'd normally reach for makes the symptom vanish. But it's also one of the most informative, if you stop swearing at it long enough to think about why.

the symptom

The service reads batches off a queue, processes them across a handful of worker threads, and writes results to a local socket that another process drains. Under real traffic it would occasionally die with a write to a closed file descriptor. Not often. Not reproducibly. Just often enough to page someone overnight and not often enough to catch in the act.

I did the obvious thing first and ran it under strace:

strace -f -e trace=write,close -o /tmp/ingest.trace ./ingest

It ran clean for a day. No crash. The instant I detached strace, it died again that evening. At that point I had my actual clue, even if it didn't feel like one yet.

what observation actually changes

strace doesn't run your program in a sandbox. What it does is intercept every system call via ptrace, which means each syscall now involves a context switch out to the tracer and back. It makes your program slower, and crucially it changes the timing between threads. A debugger does the same thing, just more so.

If a bug appears without the tracer and vanishes with it, the bug is almost always about timing. You've got a race. The tracer is perturbing the schedule just enough to hide it, the way pressing gently on a creaking floorboard stops the creak without fixing the joist.

Source diff highlighting a shared file descriptor

So I stopped trying to catch it under a tracer, because the tracer was the problem, and went back to reading the code with that hypothesis in mind: two threads, one file descriptor, no synchronisation.

the actual cause

There it was. On a reconnect, a worker thread would close() the result socket and open a fresh one, storing the new fd in a shared struct. A second worker, mid-flight, had already loaded the old fd into a register and was about to write() to it. Sometimes the close won the race. The second worker then wrote to a descriptor number that had, in the gap, been recycled by the OS for something else entirely, or was simply closed. Hence the intermittent write-to-closed-fd, and hence why it only happened under load when reconnects and writes actually overlapped.

strace hid it because the extra latency on every syscall spread the threads out in time. The window where both touched the fd narrowed to almost nothing, so I almost never lost the race while traced. Take the tracer away and the threads bunched back together and the window reopened.

how i pinned it without the tracer

Since I couldn't observe it directly, I made it worse on purpose. I added a deliberate sleep of a few milliseconds between "decide to reconnect" and "actually close", built a debug binary, and pointed a load generator at it. The crash went from once every few hours to once every few seconds. A reliable repro is most of a fix. Widening the race rather than narrowing it is the trick the tracer was doing for me in reverse.

The fix itself was dull, which is how you know it's right. The reconnect path now takes the same lock the writers take, swaps the fd under that lock, and the old descriptor isn't closed until the last writer has let go of it. A reference count and a mutex, the unglamorous bread and butter of concurrent code.

the lesson i keep relearning

When a bug runs away from your tools, the tools are telling you something. A bug that hides under strace or gdb is shouting "timing" at you. Don't fight to observe it on its own terms. Form the hypothesis, then change the timing yourself in the direction that makes it more likely, not less. The fastest way to fix a race is to lose it on demand.

I also added a metric for reconnect events, because the thing I really couldn't forgive myself for was not knowing how often that path even fired. It fired a lot more than I'd assumed. It usually does.