Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

the goroutines that never came home

A goroutine leak caused by blocking forever on a channel nobody would ever send to, and the unbuffered-channel mistake behind it.

A terminal showing a Go program's runtime metrics

The memory graph climbed, but slowly, and the heap profile looked innocent. Nothing in the heap was obviously growing. That's because the thing leaking wasn't memory I'd allocated, it was goroutines, and each one was sat there politely holding a few kilobytes of stack and refusing to die.

The giveaway was runtime.NumGoroutine(), which I'd helpfully wired into a metrics endpoint and then ignored for weeks. It only ever went up. A healthy service breathes: goroutines spike under load and fall back when it eases. Mine were one-way.

It's worth saying why a leaked goroutine is sneakier than a leaked allocation. A goroutine that's blocked is, by definition, doing nothing, so it never shows up in a CPU profile. Its stack is small, a few kilobytes, so the heap profile shrugs. It holds no lock anyone's contending for. It is the quietest possible kind of bug: a thing that exists, costs you a little, and announces itself only in aggregate, weeks later, as a line that slopes gently upward.

The cause was a worker that did some work, then tried to report its result on a channel:

func worker(jobs <-chan Job, results chan<- Result) {
    for j := range jobs {
        results <- process(j)   // blocks forever if nobody is reading
    }
}

results was unbuffered, and on the read side I had a select with a timeout. When the timeout fired, the reader gave up and moved on. The worker, mid-send, did not get the memo. It blocked on results <- ... waiting for a receiver that had already walked away, and there it stayed, forever, for the life of the process. One abandoned send per timed-out job. Under any real load that adds up fast.

A diagram of goroutines blocked on a channel send

The fix has two halves. Give the worker a way out, and give it a reason to take it. A context carries the cancellation, and a select on the send means the worker stops waiting the moment the job is no longer wanted:

func worker(ctx context.Context, jobs <-chan Job, results chan<- Result) {
    for j := range jobs {
        select {
        case results <- process(j):
        case <-ctx.Done():
            return
        }
    }
}

Now when the reader gives up, it cancels the context, the worker's send unblocks via ctx.Done(), and the goroutine returns instead of leaking. When I'm not sure whether something is leaking, I dump the full set of stacks and look for a crowd:

curl -s localhost:6060/debug/pprof/goroutine?debug=2 | grep -c chan_send

If hundreds of goroutines are all parked on the same channel send, you've found your puddle and the line above it tells you exactly which one.

The lesson I keep relearning: a blocking send is a blocking receive's problem too. Every ch <- x is a promise that someone, somewhere, will eventually read. If there's any path where that reader can disappear, your sender needs a select and an escape hatch, or it'll sit blocked until the heat death of the process. And put NumGoroutine on a graph. A leak you can see is a leak you'll fix; the one I didn't see ran for weeks.