Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

the goroutine leak that hid behind a channel nobody read

A slowly growing goroutine count in a Go service traced to workers blocked forever on a channel send after their reader gave up.

A code editor showing Go source with goroutine functions

The service was not crashing. That was almost the worst part. It just got slowly, steadily worse over days. Memory crept up, the goroutine count crept up alongside it, and every few days it would get fat enough that I would restart it and reset the clock. A restart-driven memory management strategy is a confession, not a fix, and I was tired of confessing.

The shape of it was textbook in hindsight. Goroutines are cheap, which is exactly why they are dangerous: you spawn them without thinking, and a leaked one costs you nothing visible right up until you have tens of thousands of them, each holding a little stack and whatever it captured in its closure.

Seeing the count climb

Go gives you the number for free. runtime.NumGoroutine() is one line, and net/http/pprof gives you the full goroutine dump over a socket. I had wired pprof in months earlier and never used it in anger. This was the day.

import _ "net/http/pprof"
// then later
go http.ListenAndServe("localhost:6060", nil)

A quick curl localhost:6060/debug/pprof/goroutine?debug=2 and there they were, thousands of goroutines all parked at the same line, all blocked on a channel send. Not a few. Thousands, all identical, all waiting forever to send on a channel that nothing was reading any more.

A diagram of worker goroutines feeding into a channel

The actual bug

The pattern was a worker pool. A request would fan out work to a handful of goroutines, each computing a result and sending it back on a results channel. The caller would read results until it had what it needed, then return.

The bug lived in that "until it had what it needed". On the happy path the caller drained every result and everyone went home. But the caller had a timeout, and when it fired, the caller returned early. The workers it had launched were still running. They finished their work, went to send their result on the channel, and found nobody reading. An unbuffered channel send blocks until a receiver shows up. No receiver was ever coming. So the goroutine parked on that send line and stayed there, forever, holding its stack and its captured request data.

Under light load you would never notice. Under load with a steady trickle of timeouts, every timed-out request leaked a goroutine or three, and they accumulated. The memory was not really the channels. It was thousands of frozen stacks that could never be collected because they were technically still runnable, just permanently blocked.

Fixing it properly

The lazy fix is a buffered channel big enough to absorb the orphaned sends. That is not a fix, it is moving the deadline. The right answer is the one Go has been telling us to use all along: give the workers a way to be told to stop, and make the send abandonable.

select {
case results <- r:
case <-ctx.Done():
    return
}

That select is the whole fix. The worker tries to send its result, but if the context has been cancelled, because the caller timed out and called cancel(), it takes the other branch and returns instead of blocking forever. Pair it with a context.WithTimeout on the caller side and a defer cancel(), and the workers learn that the party is over instead of waiting in an empty room.

What I actually learned

Every leaked goroutine was waiting on a channel operation that could never complete, and a goroutine blocked forever is invisible to the garbage collector because, as far as the runtime is concerned, it might wake up any moment. It never will, but the runtime cannot know that.

The rule I have written on the wall now: every goroutine needs a guaranteed way to exit, and every blocking channel operation in a goroutine that outlives a request needs a ctx.Done() escape hatch in the select. Spawning a goroutine is making a promise that it will eventually stop. I had been making that promise and quietly breaking it a few thousand times a day. pprof is the tool that caught me, and runtime.NumGoroutine() on a dashboard is now the first metric I reach for when a Go service gets mysteriously heavy.