the goroutines that wouldn't die

Source code on a screen

The service had been fine for months. Then someone added a feature, and over the following week the memory graph started its slow climb. Not a spike, nothing dramatic, just a steady upward drift that meant the pod got OOM-killed roughly every eighteen hours. Restarting reset the clock, which is the worst kind of bug because it lets you pretend it isn't happening.

The first thing I checked was the obvious one: heap profiles. go tool pprof on the heap showed nothing exciting. A bit of growth in the usual suspects, buffers and caches, but nothing that accounted for the curve I was looking at. That's usually the tell. If the heap looks innocent and memory keeps climbing, the leak isn't in the data you're holding. It's in the things holding the data.

counting goroutines

The metric I should have looked at first was goroutine count. We export it, I just hadn't thought to graph it. When I did, there it was: a line marching upward in perfect lockstep with the memory curve. Every request was leaving something behind.

You can confirm this without leaving the box. The runtime will dump every goroutine and its stack on demand:

import _ "net/http/pprof"

Then:

curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=2' > goroutines.txt
wc -l goroutines.txt

The file was enormous, and the same stack appeared thousands of times. All of them were parked in the same place, blocked on a channel send. Once you see the same frame repeated a few thousand times, the diagnosis writes itself.

A tangle of stack traces

the actual mistake

The new feature spun up a goroutine per request to do some work in the background and report the result back on a channel. Simplified, it looked like this:

func handle(ctx context.Context) Result {
    ch := make(chan Result)
    go func() {
        ch <- doWork() // blocks here forever
    }()

    select {
    case r := <-ch:
        return r
    case <-ctx.Done():
        return Result{} // we leave, the goroutine doesn't
    }
}

Read that select carefully. When the context is cancelled, the client has gone away or the timeout fired, we return immediately. Good. Except the goroutine we started is still sitting there, trying to send on ch. Nobody is reading ch anymore. The channel is unbuffered. So the send blocks forever, the goroutine never exits, and the closure keeps alive every variable it captured.

Under normal load most requests completed before the timeout, so the leak was small and slow. The new feature happened to be the one with a generous upstream timeout, so cancellations were common enough to make the drift visible. That's why it took a week to show up and why nothing in the heap looked wrong: the leaked memory was stacks and captured closures, not anything that profiled cleanly as "your data".

the fix, and the habit

The fix is a one-character class of change: give the channel somewhere to go.

ch := make(chan Result, 1) // buffered, so the send never blocks

With a buffer of one, the goroutine can always complete its send even if we've already walked away. It writes its result into the buffer, the goroutine returns, the garbage collector reclaims everything. The result we never read is collected along with the channel.

The other valid shape is to make the goroutine respect the context itself and select on ctx.Done() for its send too, but for fire-and-forget work where you only sometimes want the answer, a buffer of one is the simpler and correct answer. The rule I now repeat to myself: if you start a goroutine that sends on a channel, ask who guarantees the receive happens. If the answer is "the happy path does", you have a leak waiting for a sad path.

I added a goroutine-count alert that afternoon. It's a cheap signal and it would have caught this on day one instead of day seven. Memory graphs tell you that you're bleeding. Goroutine graphs tell you where the wound is.