The Goroutine Leak That Hid In Plain Sight

A code editor showing Go source

A service of mine had a habit of falling over after about a week. Not crashing, exactly. It would get slower and slower, the memory graph would creep up on a gentle diagonal, and then somewhere around day eight it would start missing health checks and get restarted. The restart cleared it, the graph reset to the floor, and the cycle began again. A weekly sawtooth, the classic signature of a leak.

The memory was the symptom everyone fixated on, because memory is what the dashboards show. But memory was downstream. The real number, the one I wasn't graphing, was the goroutine count.

the tell I was ignoring

Go makes this embarrassingly easy to check, which makes it more embarrassing that I wasn't. runtime.NumGoroutine() is right there, and net/http/pprof exposes a full goroutine dump over HTTP. I'd had pprof wired in the whole time and never looked at it for this, because the problem presented as memory and I went hunting for memory.

When I finally pulled the goroutine profile from a box that had been up for six days, the answer was immediate:

$ curl -s localhost:6060/debug/pprof/goroutine?debug=1 | head -20
goroutine profile: total 48213
41097 @ 0x43e2c5 0x44d1f8 0x6f12a0 0x6f0e15
#       0x6f12a0  myservice/poller.(*Poller).watch+0x120
...

Forty-eight thousand goroutines, forty-one thousand of them parked in the same function. That is not memory pressure. That is a leak with a return address.

An abstract illustration of concurrent flows

what the code thought it was doing

The leaking function was a per-request watcher. Each incoming request spun up a goroutine to poll an upstream and stream results back over a channel. The intent was: when the request ends, the watcher stops. The reality was that the watcher had no idea the request had ended, because I'd never told it.

Stripped down, it looked like this:

func (p *Poller) watch(key string) <-chan Event {
    out := make(chan Event)
    go func() {
        ticker := time.NewTicker(time.Second)
        defer ticker.Stop()
        for range ticker.C {
            ev := p.fetch(key)
            out <- ev // blocks forever once the reader is gone
        }
    }()
    return out
}

The bug is the unbuffered send, out <- ev. As long as something is reading from out, this is fine. But when a client disconnects, the handler returns and stops reading. The goroutine ticks again a second later, tries to send, and there is no receiver. So it blocks. On that send. Permanently. The ticker keeps it referenced, the goroutine never returns, and every disconnected client leaves one of these behind forever.

Multiply by a service handling a few requests a second over a week and you get forty-eight thousand parked goroutines, each holding a ticker, a closure, and a slice of the upstream key. That's where the memory went. The goroutines were the leak; the memory was just what they were holding.

the fix is a context, as it almost always is

The honest fix in Go is the one I keep relearning: every goroutine needs a way to be told to stop, and that way is almost always a context.Context. The watcher should take the request's context and select on its Done() channel so a send can never block forever:

func (p *Poller) watch(ctx context.Context, key string) <-chan Event {
    out := make(chan Event)
    go func() {
        defer close(out)
        ticker := time.NewTicker(time.Second)
        defer ticker.Stop()
        for {
            select {
            case <-ctx.Done():
                return
            case <-ticker.C:
                select {
                case out <- p.fetch(key):
                case <-ctx.Done():
                    return
                }
            }
        }
    }()
    return out
}

Two things changed. The outer select lets the goroutine wake on cancellation between ticks. And the inner select around the send is the part I'd have missed a few years ago: even mid-send, if the context is cancelled, we bail instead of blocking on a dead receiver. The caller passes r.Context() from the HTTP request, and when the client disconnects, Go cancels that context for free. The watcher notices and returns. No leak.

the guard rail I added afterwards

Fixing the bug is necessary but not sufficient, because the next leak will look just as innocent. So I added a cheap canary: a background goroutine that logs runtime.NumGoroutine() every minute, and an alert if it crosses a threshold that's comfortably above normal steady state.

go func() {
    for range time.Tick(time.Minute) {
        n := runtime.NumGoroutine()
        if n > 2000 {
            log.Printf("goroutine count high: %d", n)
        }
    }
}()

It's crude, and 2000 is a number I picked by watching the service behave for a day. But a leak now announces itself in minutes rather than hiding inside a memory graph for a week. The whole episode was a reminder that in Go, "memory leak" and "goroutine leak" are usually the same sentence, and the goroutine count is the one that points straight at the line of code. I'd been reading the wrong gauge the whole time.

The deeper lesson is one I'll keep relearning until it sticks: a goroutine that can block on a channel without also watching a cancellation signal is a leak waiting for a slow week to reveal it. Start every goroutine by asking how it dies. If you can't answer that, you've already written the bug.