the goroutines i forgot to stop

A screen full of Go source code

The service ran fine for days and then fell over. Memory crept up, the runtime spent more and more time in the scheduler, and eventually the box started swapping and the whole thing ground to a useless crawl. Restart it and the clock reset: a few days of calm, then the same slow decline. Classic leak, except I'd convinced myself Go didn't really let you leak, what with the garbage collector and all. It does. You just leak goroutines instead of memory, and the memory follows them out the door.

The service was a small ingest daemon. It pulled messages off a queue, fanned each one out to a handful of workers, collected the results, and wrote them on. Nothing exotic. The kind of thing you write in an afternoon and expect to run for a year.

what the numbers were saying

The first useful thing I did was add an expvar line publishing runtime.NumGoroutine() so I could actually watch the count instead of guessing. It told a very clear story:

12:00  goroutines=48
13:00  goroutines=311
14:00  goroutines=579
15:00  goroutines=844

That's not a memory leak in the usual sense. That's goroutines being created and never finishing, each one holding onto its stack and whatever it had closed over. The GC can't collect a goroutine that's still alive, even if it's blocked forever doing nothing. And mine were blocked forever doing nothing.

A graph-like screen of code and numbers

To find out what they were stuck on, I pulled in net/http/pprof, which is about three lines and gives you a goroutine dump on a URL. Hitting /debug/pprof/goroutine?debug=2 printed every live goroutine with its stack. Hundreds of them were sitting on the exact same line: a channel receive in the result collector. They were all waiting to send a result that nobody was ever going to read.

the bug, which was embarrassingly simple

Here is the shape of what I'd written. For each incoming message I started a few worker goroutines, gave each a results channel, and read from it:

func handle(msg Message) Result {
    results := make(chan Result)
    for _, w := range workers {
        go func(w Worker) {
            results <- w.Do(msg) // blocks until someone receives
        }(w)
    }
    // take the first result that comes back, return it
    return <-results
}

Spot it? I start, say, four workers. I read exactly one result, the fastest one, and return. The other three workers finish their work and then try to send on results. But handle has already returned, nobody is receiving, and results is unbuffered. So they block on the send. Forever. Three goroutines stranded per message, quietly, while the function that created them walked away whistling.

It worked in testing because in testing I sent a handful of messages and looked at the output, which was correct. The first result back was the right answer. The leak only shows up under sustained traffic over hours, which is precisely the condition a unit test doesn't reproduce.

the fix: a context, and a buffered channel

There were two correct fixes and I applied both. The cheap one: buffer the channel so a send never blocks even if nobody's listening.

results := make(chan Result, len(workers))

Now every worker can send and exit regardless of whether anyone reads. The unread results are simply garbage collected along with the channel once handle returns. That alone stops the leak dead.

The proper one, because I wanted the workers to actually stop wasting effort once I had my answer, was a context:

func handle(ctx context.Context, msg Message) Result {
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()

    results := make(chan Result, len(workers))
    for _, w := range workers {
        go func(w Worker) {
            results <- w.Do(ctx, msg)
        }(w)
    }
    return <-results // defer cancel() signals the rest to give up
}

The defer cancel() fires the moment handle returns, the workers see ctx.Done() and abandon whatever they're doing, and nothing is left hanging. The buffered channel handles the race where a worker finishes in the same instant; the context handles the workers that are still grinding away on work I no longer need.

Two terminal windows showing program output

After deploying, the goroutine count sat flat at around 50 for a week, which is exactly where it should have been all along.

The lesson I took away, and it's one of those you only really learn by being bitten, is that every goroutine you start needs an answer to one question: how does it end? If you can't say, you've written a leak. A go statement is a promise that something will eventually finish, and the compiler will never hold you to it. The pprof goroutine dump is the tool that holds you to it instead, and it's now the first thing I reach for whenever a long-running Go process starts behaving as though it's getting tired.