This one took a week, and the annoying part is that the actual bug is four lines and the fix is one. The week went on everything around it: noticing, doubting, measuring, and finally believing the numbers. I'm writing the whole thing out because the bug itself is boring and well documented, but the process of cornering it is the bit that's actually worth keeping.
The symptom
We had a service that did request fan-out. Each incoming request would spin up a handful of goroutines to talk to downstream systems, gather what came back, and respond. Standard stuff. It had run fine for months.
Then someone noticed the memory graph. Not a spike, nothing dramatic. A gentle slope upward, day after day, the kind you only see if you zoom the dashboard out to a fortnight. Restart the pod and it dropped to baseline, then started climbing again. A textbook leak, and textbook leaks in a garbage-collected language are usually one of two things: something is holding references it shouldn't, or goroutines aren't exiting.
The second metric told us which. We export runtime.NumGoroutine() as a gauge, and that line was climbing in perfect lockstep with the memory. It never came back down. Under a garbage collector, a goroutine that never returns is a small, permanent leak: its stack, anything it captured, anything it's blocked on. Multiply by a steady trickle of requests and you get exactly that patient upward slope.
Doubting the obvious
My first theory was wrong, which is normal. I assumed a downstream call was hanging, that we were waiting on a slow dependency and piling up goroutines stuck in a network read. That has a clean signature, so I went looking for it, and it wasn't there. Latency to every downstream was healthy. Nothing was timing out. The goroutines weren't stuck in the network, they were stuck somewhere in our own code, which is both more embarrassing and more interesting.
pprof, finally
I should have reached for pprof on day one instead of day three. We already had the endpoint registered, so it cost nothing but the habit of remembering it exists.
import _ "net/http/pprof"
The goroutine profile is the one that matters here, and you want the full dump with stacks, not the summarised view:
curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=2' > goroutines.txt
That file is gold. Every live goroutine, its full stack, and crucially how long it has been blocked and on what. I sorted by the blocking duration and the picture was immediate. Thousands of goroutines, all parked in the same place, all blocked on a channel send, all of them many minutes old and getting older.
goroutine 48213 [chan send, 9 minutes]:
main.(*fetcher).gather.func1(...)
/app/internal/fetch/fetch.go:88 +0x21c
Line 88. A channel send. Thousands of them, waiting to write into a channel nobody was reading any more.
The actual bug
Here's the shape of the offending code, simplified:
func (f *fetcher) gather(ctx context.Context, ids []string) []result {
out := make(chan result)
for _, id := range ids {
go func(id string) {
r := f.fetchOne(ctx, id)
out <- r // line 88
}(id)
}
var results []result
for range ids {
select {
case r := <-out:
results = append(results, r)
case <-ctx.Done():
return results // <- here is the problem
}
}
return results
}
Read it slowly. We launch one goroutine per id. Each one fetches, then sends its result into the unbuffered out channel. The parent loops, collecting results, but it also watches ctx.Done(). When the context is cancelled, usually because the caller hit its deadline, the parent returns early.
The moment it returns, nobody is reading out any more. But the worker goroutines don't know that. They finish their fetch and then sit forever at out <- r, blocked on a send into a channel with no receiver. They never see the cancellation, because the send happens before they'd ever get a chance to check the context. Every cancelled request that had work still in flight left behind a goroutine that would never, ever exit.
On a quiet day, with few timeouts, the leak was a trickle. It only became visible because traffic and timeout rates crept up over a couple of weeks, and the trickle became a stream.
Two ways to fix it
There are two honest fixes, and they're not quite equivalent.
The blunt one is to buffer the channel so every worker can always complete its send, whether the parent is still listening or not:
out := make(chan result, len(ids))
Now out <- r never blocks, the workers always finish, and the goroutines always exit. This works, and it's a perfectly reasonable fix for fan-out where the number of results is bounded and known. It's the one I shipped first, because it's small and obviously correct, and the goroutine count flattened the moment it went out.
The more correct fix is to make the workers honour cancellation on the send as well, so they give up if the parent has gone away:
go func(id string) {
r := f.fetchOne(ctx, id)
select {
case out <- r:
case <-ctx.Done():
}
}(id)
This matters when the result count isn't bounded, or when you can't size the buffer sensibly, because a buffer only papers over the problem if it's big enough. The select version doesn't care how many results there are. If the parent has gone, the worker drops its result and exits. I ended up using both: the buffer for the predictable fan-out, the select for the streaming path elsewhere in the codebase that had the same latent bug.
What I took from it
A few things stuck.
The goroutine count gauge earned its place permanently. A flat line is reassuring, a climbing one is an early warning you get days before memory pressure forces your hand. It's cheap to export and I now consider a service without it slightly undressed.
pprof's goroutine?debug=2 is the single most useful command I know for this class of bug, and I keep talking myself out of using it early because it feels heavyweight. It isn't. It's a curl and a text file, and it tells you exactly where every goroutine is sleeping.
And the design lesson, the one I clearly hadn't fully absorbed despite knowing it in the abstract: every goroutine you start needs a guaranteed path to exit. Not a likely path, a guaranteed one. If a goroutine can block on a send, a receive, a lock, or a network read, ask what cancels it. If the answer is "nothing, as long as the happy path holds", you've written a leak and you just haven't measured it yet.
A week of work, one line of fix, and a gauge I'll never delete. I can live with that trade, mostly because the alternative is finding out the same lesson again next year.