There's a particular kind of bug that doesn't want solving at the keyboard. You know the shape of it: the test fails one time in twenty, the logs are clean, and every time you add a print statement the problem politely goes away. I'd had one of those for two days going into the new year. By the second afternoon I was no longer debugging, I was just rereading the same forty lines and hoping they'd confess.
So on Saturday I gave up and went for a ride instead.
the problem, briefly
The system was a small job queue. Workers pulled tasks off a Redis list, did some work, and wrote results back. Occasionally, and only under load, two workers would process the same task. Not often. Just often enough to corrupt a counter and make the numbers wrong in a way that someone downstream noticed and, quite reasonably, complained about.
I'd done the obvious things. The pop was atomic. The lock was a SET key value NX EX 30, which is the textbook pattern, and the release checked the value before deleting so I wasn't releasing someone else's lock. I'd read my own code so many times it had stopped meaning anything. The words were just there, like a sign you've walked past every day for a year and couldn't draw from memory.
That's the state where more staring does nothing. The model of the system in my head had a flaw, and rereading the code that matched the flawed model only reinforced it. I needed to drop the model entirely and rebuild it, and you can't do that while you're holding the broken one up to the light.
the ride
It was cold and grey and the roads were wet, which is January in this part of the country doing exactly what it says on the tin. I had no plan beyond "out, along the river, and back before dark". No headphones, no podcast, nothing to fill the space. Just the noise of the tyres and the business of not falling off on the icy bits.
The thing nobody tells you about riding to think is that the trick is to not try to think. For the first hour my brain kept dragging itself back to the queue, turning the lock over, getting nowhere. I let it. Around the point where the lane climbs away from the water and you stop being able to spare any attention for anything except your legs, it finally shut up.
And then, somewhere past the halfway point, with no prompting at all, the actual answer arrived. Fully formed, the way these things do. It wasn't the lock. The lock was fine. The problem was the timeout.
My tasks could, under load, take longer than the 30-second lock expiry. So worker A grabs the task, sets a 30-second lock, and then gets stuck behind a slow database call for thirty-five seconds. At second thirty, the lock expires on its own. Worker B comes along, sees no lock, grabs the same task, and starts work. Now both are running. Worse, when A finishes, its release logic checks the value, sees its own value is gone (B holds the lock now), and correctly declines to delete, so no error is raised. Everything looks clean. The counter just quietly goes wrong.
I hadn't seen it because I'd been looking at the locking code, and the locking code was correct. The bug was in the gap between the lock's lifetime and the work's lifetime, which is exactly the kind of thing that doesn't live in any single line you can point at.
why it works
I don't think there's anything mystical about it. When you stop feeding a problem new input, your head keeps chewing on the old input in the background, and crucially it does so without the tunnel vision you build up when you're actively concentrating. The fixation that helps you hold a complicated thing in your mind is the same fixation that stops you stepping outside it. Cycling, or walking, or washing up, gives the focused part of your brain something to do so the rest of it can wander off and try the answers you'd already rejected as too obvious.
The fix, for the record, was two parts. First, make the lock timeout generous and have the worker renew it periodically while the task is still running, a lease rather than a fixed expiry. Second, make the result write idempotent so that even if the impossible happens, processing a task twice can't corrupt the counter. Defence in depth, because relying purely on a lock for correctness in a distributed system is how you end up writing blog posts like this one.
I got home, made a coffee, wrote the fix in about twenty minutes, and the flaky test went green and stayed green. Two days at the desk, forty miles away from it, and the answer turned up on its own the moment I stopped demanding it.
So this is less a bug report and more a note to self. When you've reread the same code three times and it's gone abstract, that's the signal. Stop. Go outside. The bug will still be there when you get back, but you probably won't need it to be.