the goroutine leak that hid behind a healthy graph

A screen of Go source code in a dark editor

The service did not crash. That is what made it irritating. Memory crept up over roughly four days, from a comfortable 180MB to something north of a gigabyte, and then the box would OOM and systemd would restart it and the four-day clock would start again. CPU was flat. Request latency was flat. The error rate was zero. By every dashboard I had, the thing was perfectly healthy right up until it wasn't.

I want to be honest about how long this took me, because I think the honesty is the useful part. I spent the first day looking in entirely the wrong place. A slow climb in RSS reads, to me, as a heap leak, so I went hunting for retained objects: a map that never had keys deleted, a cache without a bound, a slice I kept appending to. I added pprof, pulled a heap profile, stared at it, and found nothing damning. The heap was big, sure, but it was big in a way that looked like a lot of live goroutine stacks rather than a lot of live data. Which, with hindsight, was the entire answer sitting in plain sight. I just didn't read it that way yet.

the number that should have been flat

The thing that finally turned the light on was expvar. I'd been exporting runtime.NumGoroutine() almost as an afterthought, the way you do, and I'd never actually looked at it. When I finally graphed it against memory, the two lines were the same line. Goroutines climbed in lockstep with RSS, never coming down, a clean diagonal across four days.

So it wasn't data I was leaking. It was goroutines. Each one carries its own stack, and a few thousand of them sitting blocked is megabytes you never get back. The heap profile had been telling me this; I'd just been asking it the wrong question.

pprof has a goroutine dump, and it is the single most useful thing in this situation. You get it by hitting the debug endpoint:

go tool pprof http://localhost:6060/debug/pprof/goroutine

or, more bluntly, you ask for the full text dump with ?debug=2 and you read the stacks like a doctor reading an x-ray. When I did, the picture was immediate. Thousands of goroutines, all parked at exactly the same line, all blocked on a channel send.

A close-up of program code on a monitor

the actual bug

The pattern was a fan-out worker that did some work and then reported its result back over a channel. Stripped down, it looked like this:

func (s *Server) handle(ctx context.Context) error {
    results := make(chan result)
    for _, job := range jobs {
        go func(j job) {
            r := do(j)
            results <- r // blocks until someone reads
        }(job)
    }

    select {
    case r := <-results:
        return s.use(r)
    case <-ctx.Done():
        return ctx.Err()
    }
}

Read that select again. It takes the first result and returns, or it bails on context cancellation and returns. Either way, it returns. And the moment it returns, nobody is reading results any more.

But the other goroutines are still running. They finish their work, they reach results <- r, and there is no receiver. The channel is unbuffered. So they block on the send. Forever. They are not waiting on a timeout, they are not selecting on the context, they are just stuck mid-send with a stack frame pinned in memory, and the garbage collector cannot touch any of it because the goroutine is technically still alive and reachable.

Every request that returned early, which was most of them under load, abandoned a handful of goroutines to block eternally on a send nobody would ever receive. A few per request, thousands per hour, a gigabyte over a long weekend. The graph stayed flat because none of this was an error. The goroutines weren't doing anything wrong. They were doing exactly what I'd written, which was to wait politely until the end of time.

the fix, and the lesson under it

The narrow fix is to make the send respect cancellation, so a goroutine that has nowhere to deliver its result gives up rather than blocking:

go func(j job) {
    r := do(j)
    select {
    case results <- r:
    case <-ctx.Done():
    }
}(job)

Now when the handler returns, it cancels the context (deferred cancel() from context.WithCancel, which I should have had and didn't on the first version), the abandoned workers select ctx.Done() instead of blocking on the send, and they exit. The stacks unwind, the memory comes back, and NumGoroutine settles to a flat, boring line.

You can also do it with a buffered channel sized to the number of jobs, so every send has somewhere to land whether anyone reads it or not. That works and it's tempting because it's one character of change. I went with the context version because it doesn't just stop the leak, it actually stops the work: a cancelled request no longer pins workers that are computing results destined for the bin.

The broader thing I took from it is that "every goroutine you start needs a guaranteed way to finish" is one of those rules you nod along to and then quietly break the first time a select returns early. A goroutine is not free, and an unbuffered channel send is a blocking operation that will wait for you indefinitely without complaint. There is no error, no log line, no panic. It just sits there, holding its stack, looking healthy on every graph except the one I wasn't watching.

I export NumGoroutine on everything now, and I put it on the same dashboard as memory, right next to each other, because when those two lines start to rhyme you already know what you're looking for.