The graph was the giveaway, eventually. RSS on one service climbed about 40MB a day, every day, dead straight. No sawtooth, no GC clawing it back. Just a ramp. Restart it, flat for a while, then the ramp again. Classic leak, except this is Go and I had told myself for years that Go doesn't leak memory in the way C does. It doesn't. It leaks goroutines, which is worse, because a goroutine holds onto everything in its stack and closure and just sits there forever looking innocent.
The number that actually mattered was go_goroutines. We export it from the runtime, it sits in Prometheus, and I had never once looked at it. When I finally did, it tracked the memory ramp exactly. Thousands of goroutines, climbing, never coming down.
The cause was the kind of thing that reads fine in review. A request handler spun off a worker to do some background fan-out, and the worker sent its result back on a channel:
func handle(ctx context.Context) Result {
ch := make(chan Result)
go func() {
ch <- doWork() // blocks here forever if nobody reads
}()
select {
case r := <-ch:
return r
case <-ctx.Done():
return Result{} // caller gives up, ch is never read
}
}
Spot it? When the context cancels, we return. The goroutine is still sitting on ch <- doWork(), blocked, because the channel is unbuffered and the only reader has wandered off. Under load the timeouts fired often enough that we orphaned a steady trickle of these, each one pinning its work and its stack. The leak was exactly as fast as our timeout rate, which is why it was so suspiciously linear.
The fix is dull, which is the right kind of fix. Buffer the channel so the send never blocks:
ch := make(chan Result, 1)
One slot. The worker sends, the value lands in the buffer whether anyone is listening or not, the goroutine returns, everything gets collected. That's it.
I went looking for others afterwards and found two more of the same shape. goleak from the uber-go folks catches these in tests if you remember to add it, and I have now bolted it onto the packages that spawn goroutines. It runs at the end of a test and shouts if anything is still running that shouldn't be. It would have caught all three of these on day one.
The lesson I keep relearning: in Go the leak isn't the bytes, it's the goroutines, and the metric that tells you is already there. Plot go_goroutines next to RSS on every service you run. If they ramp together, you're not leaking memory, you're leaking workers, and somewhere there's a channel nobody is reading.