the goroutines that never came home

Terminal showing Go source on a dark editor

The symptom was boring. A small internal service, the kind nobody thinks about, was using a bit more memory every day. Not a spike, not a crash, just a slow creep upward that the restart on each deploy had been politely hiding for months. Then we slowed our release cadence down, the process ran for nine days straight, and the graph finally had room to tell the truth.

Memory was the red herring. The real number was goroutine count, and it only ever went up.

If you have a pprof endpoint wired in, this is a thirty second diagnosis. I did not, the first time, so I added one.

import _ "net/http/pprof"

go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

Then go tool pprof http://localhost:6060/debug/pprof/goroutine and top. Hundreds of goroutines all parked in the same place, all blocked on a channel receive that was never going to fire.

Stack of blocked goroutine traces visualised

The code was the textbook mistake, which is somehow always more embarrassing than an exotic one. A function fanned out work, collected results on a channel, and used a context with a timeout so the caller never waited too long:

func fetchAll(ctx context.Context, ids []string) []Result {
    out := make(chan Result)
    for _, id := range ids {
        go func(id string) {
            out <- fetch(id) // blocks here forever
        }(id)
    }

    var results []Result
    for range ids {
        select {
        case r := <-out:
            results = append(results, r)
        case <-ctx.Done():
            return results // and here is the leak
        }
    }
    return results
}

When the context fired, the caller returned. But out was unbuffered, and any goroutine that had not yet delivered its result was now blocked on out <- fetch(id) with nobody left to receive. Those goroutines never exit. Every timed-out request leaked a handful of them, and the small ones each held a closure, a connection, a buffer. Days of that adds up to a graph that goes one way.

The fix is dull, which is the point. Give the channel enough buffer that a send never blocks, so a goroutine can always finish and exit even when nobody is listening:

out := make(chan Result, len(ids))

That alone stops the leak. The send always completes, the goroutine always returns, the garbage collector does the rest. You can do it more carefully by threading the context into fetch and selecting on ctx.Done() inside the send, which is the right answer when fetch is expensive and you genuinely want to cancel it. For this service the buffer was enough, and enough is a perfectly good place to stop.

The lesson I actually took away was not about channels. It was that "memory slowly rising" is a symptom with a dozen causes, and goroutine count is the cheaper first thing to look at in a Go program. A goroutine that blocks forever costs you a slot on a graph long before it costs you an OOM. Now the goroutine profile is the first dashboard I open, and a flat line there is a small daily reassurance that nothing is quietly waiting for a message that will never come.