Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

the goroutine that never came home

A slow goroutine leak in a Go service caused by a channel nobody ever read from, and the context cancellation that fixed it.

A close-up of source code on a screen

The service ran fine for about a day, then the memory graph started its slow climb. Not a spike, nothing dramatic. Just a steady, polite creep upwards until the box ran out of headroom and the OOM killer stepped in. Restart, and the clock began again.

My first instinct was the heap, so I reached for pprof. But the heap profile looked reasonable. What did not look reasonable was the goroutine count, which had climbed into the tens of thousands and kept going.

A diagram of tangled program flow

Each goroutine holds a stack, and a few kilobytes times tens of thousands is real money. So I had a goroutine leak, which in Go almost always means a goroutine blocked on a channel that nobody will ever send to or receive from. They do not get collected. They just sit there, parked, forever.

The culprit was a worker that fired off a request and waited on a result channel:

func (c *Client) fetch(id string) (*Result, error) {
    ch := make(chan *Result)
    go c.doRequest(id, ch)
    return <-ch, nil
}

Reads fine. The problem is what happens upstream. The caller had a timeout, and when it expired it walked away from fetch entirely. But doRequest was still running, and when it finally finished it tried to send on ch, an unbuffered channel with no receiver left. So it blocked. Forever. One leaked goroutine per timed-out request, and on a bad day there were a lot of those.

The fix is the boring, correct one: thread a context through and stop pretending a goroutine will tidy up after itself.

func (c *Client) fetch(ctx context.Context, id string) (*Result, error) {
    ch := make(chan *Result, 1)
    go c.doRequest(ctx, id, ch)
    select {
    case r := <-ch:
        return r, nil
    case <-ctx.Done():
        return nil, ctx.Err()
    }
}

Two changes carry the weight. The channel is buffered with capacity one, so doRequest can always complete its send even if nobody is listening, and then it returns and gets collected normally. And the select means the caller stops waiting the moment the context is cancelled, rather than blocking on a result that may never come.

What annoyed me is that none of this was subtle. "A goroutine started and never finished" is the most ordinary leak in the language. I just hadn't been looking at the goroutine count, because the heap profile was where I expected the answer to be. It wasn't.

Now there's an alert on goroutine count, and runtime.NumGoroutine() gets logged on a timer. If it climbs and never comes back down, something is parked and waiting for a send that isn't coming. That's the whole tell, and once you've seen it once you see it everywhere.