Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

The Goroutine Leak Hiding in a Timeout

A slow climb in goroutine count turned out to be senders blocked on an unbuffered channel after the receiver had already given up and returned.

Lines of code on a programmer's screen

The service didn't crash. That was almost the problem. It just got slightly slower and slightly fatter every day, until a week later it was using three times the memory it should and nobody could say why. The giveaway was the goroutine count on the metrics dashboard, a slow, relentless climb that never came back down.

The pattern was the classic one. A handler fanned out work to a few goroutines, collected results over an unbuffered channel, and returned the first one to finish. The trouble is what happens to the others. When the parent returned early on a timeout, nothing was reading the channel any more, so the stragglers blocked forever trying to send. A blocked goroutine is never collected. Do that a few thousand times an hour and the climb is exactly what you'd expect.

The fix was a one-line change: give the channel enough buffer that every sender can deposit its result and exit, whether anyone is listening or not.

results := make(chan result, len(workers))

With the buffer in place the senders never block, the goroutines finish, and the count flattened out within minutes of deploying. I'd written that exact bug into a talk slide years ago as the thing not to do, then walked straight into it on a Tuesday. Knowing the failure mode and spotting it in your own code are, apparently, two different skills.