goroutines and the leak i didn't see

Source code on a programmer's screen

The service didn't crash. That was the maddening part. It ran for days, serving traffic perfectly, and the only symptom was that its memory use climbed in a slow, patient, utterly relentless line. Restart it and the line reset to the bottom and began climbing again. A sawtooth that only ever went up. Somewhere in there was a leak, and in a garbage-collected language a memory leak means something is still reachable that shouldn't be. The question is what's holding the reference.

In Go, very often, the answer is goroutines.

the symptom

The first useful number is the goroutine count, and Go gives it to you almost for free. If you've got pprof wired up, which you should, you can read it straight off:

$ curl -s localhost:6060/debug/pprof/goroutine?debug=1 | head -1
goroutine profile: total 48211

Forty-eight thousand goroutines on a service handling a few hundred requests a second is not a healthy number. A goroutine is cheap, a couple of kilobytes of stack, but multiply a couple of kilobytes by forty-eight thousand and add whatever each one is holding onto and you have your slow climb. Goroutines that never finish are the most common memory leak in Go by a wide margin, and they don't show up as the leak you expect, because the memory profile points at whatever the stuck goroutines are referencing, not at the goroutines themselves. The goroutine profile is the one that tells the truth.

I watched the total over a few minutes. It only went up, and it went up in lockstep with request volume. So each request was leaving a goroutine behind.

A wall of code in a dark editor

the cause

Here is the shape of the offending code, simplified. A handler kicked off a background fetch and waited for it with a timeout:

func handle(ctx context.Context) (Result, error) {
    ch := make(chan Result)      // unbuffered

    go func() {
        ch <- fetch()            // blocks until someone receives
    }()

    select {
    case r := <-ch:
        return r, nil
    case <-ctx.Done():
        return Result{}, ctx.Err()   // timeout: we leave
    }
}

Read it slowly and the bug walks out to meet you. The channel is unbuffered. When the request times out, the select takes the ctx.Done() branch and the function returns. Nobody is left to receive from ch. But the goroutine inside is still sitting on ch <- fetch(), blocked forever, because a send on an unbuffered channel only completes when a receiver shows up, and the receiver has gone home.

Every timed-out request left exactly one goroutine wedged on that send line, holding the channel, holding the result, holding whatever fetch() had allocated, until the heat death of the process. Under normal load the timeouts were rare, so the leak was slow. Under a bit of pressure, when more requests timed out, it climbed faster. Restarting cleared it, which is exactly the false comfort that lets a leak like this live for months.

the fix

The fix is to make the send not block, so the orphaned goroutine can always finish and be collected. The smallest change is a buffer of one:

ch := make(chan Result, 1)   // buffered: the send always completes

Now when the goroutine does ch <- fetch(), there's room in the buffer, the send completes immediately, the goroutine returns, and the garbage collector reclaims everything. The result sits unread in the buffered channel, the channel itself becomes unreferenced once handle returns, and the whole lot is collected. No leak.

The more disciplined version uses context properly so the work itself stops early, not just the waiting:

func handle(ctx context.Context) (Result, error) {
    ch := make(chan Result, 1)
    go func() {
        ch <- fetchCtx(ctx)   // fetch respects cancellation
    }()
    select {
    case r := <-ch:
        return r, nil
    case <-ctx.Done():
        return Result{}, ctx.Err()
    }
}

If fetchCtx honours the context, then when the request times out the fetch is told to give up, returns promptly, sends into the buffered channel, and the goroutine exits in good order. That's the correct answer: don't just stop waiting for the work, stop the work.

what I took away

The buffered-channel-of-one is a genuinely useful pattern for exactly this fan-out-and-wait case, and I now reach for it by default whenever a goroutine sends a single result back to a caller that might walk away. But the bigger habit it cemented is watching the goroutine count. The number should be roughly flat under steady load. If it climbs and never comes down, you have goroutines that started and never returned, and somewhere there's a channel send or receive with no partner, or a range over a channel nobody closes, or a select with no exit.

$ watch -n5 'curl -s localhost:6060/debug/pprof/goroutine?debug=1 | head -1'

Leave that running next to your service for a bit. It costs nothing and it tells you immediately whether you're leaking. A flat line is a healthy program. A staircase is a bug you haven't found yet.

The annoying truth is that the original code passed every test I had. It worked. It returned the right answers and respected its timeout. It just also left a little ghost behind every time it timed out, and ghosts, given a few days, fill the house.