The service had been up for nine days when the alert fired. Resident memory had crept from a healthy 80MB to just over a gigabyte, and the line on the graph was a near-perfect diagonal. No spikes, no sawtooth from the garbage collector reclaiming anything. Just a slope. That shape is the tell. A leak that ramps smoothly and never recovers is almost never your heap data growing, it's something the runtime can't collect because something is still holding on.
In Go, the thing still holding on is usually a goroutine.
Goroutines are cheap, but they are not free
The pitch for goroutines is that they cost a couple of kilobytes of stack and you can have hundreds of thousands of them. All true. The part that gets glossed over is that a goroutine which blocks forever is a goroutine that lives forever, and everything on its stack lives with it. Every variable it closed over, every buffer it allocated, every channel it's parked on. The GC will not touch a live goroutine, and a goroutine blocked on a channel send or receive is, as far as the scheduler is concerned, perfectly alive. It's just waiting. Patiently. Forever.
My leak was a worker pool. The design looked sensible enough on the day I wrote it: a function that took a slice of jobs, fanned them out across some workers, collected the results, and returned. The skeleton was roughly this.
func process(jobs []Job) []Result {
out := make(chan Result)
for _, j := range jobs {
go func(j Job) {
out <- doWork(j)
}(j)
}
var results []Result
for i := 0; i < len(jobs); i++ {
results = append(results, <-out)
}
return results
}
Read that and it looks fine. It even works, most of the time. The bug is that out is unbuffered and I only ever receive len(jobs) values from it. As long as every worker finishes and every result gets read, the books balance. The moment they don't, you've leaked.
Where it actually went wrong
The real code wasn't quite this. The real code had a context with a timeout wrapped around the whole thing, and a select in the collector that returned early if the context was cancelled.
for i := 0; i < len(jobs); i++ {
select {
case r := <-out:
results = append(results, r)
case <-ctx.Done():
return results, ctx.Err()
}
}
That early return is the leak. When the context times out, the collector bails. But the workers are still running. They finish doWork, they reach out <- result, and because nobody is receiving anymore, they block on that send. Forever. Every timed-out request left a fistful of goroutines parked on a send to a channel with no reader, each one pinning its job data and its result in memory.
Nine days of occasional timeouts and there's your gigabyte.
The thing that makes this nasty is that it's invisible in the happy path. Every test passed. Local runs were clean because nothing timed out on a developer's laptop hitting a local stub. It only showed up under real latency, at low frequency, over days. The unit of failure was small and the time to symptom was long, which is the worst combination for noticing.
Finding it
The good news is that Go ships the tools to catch this and they're genuinely excellent. I'd already got net/http/pprof mounted on an internal port, so the first thing I did was pull the goroutine profile from the live process.
go tool pprof http://localhost:6060/debug/pprof/goroutine
Then top inside the interactive prompt. The answer was right there at the top of the list: tens of thousands of goroutines, all parked at the same line, the channel send in my worker. You don't need to guess at this stuff. The runtime knows exactly where every goroutine is stuck and it will tell you if you ask. I'd add expvar or a periodic runtime.NumGoroutine() log to anything long-lived now, purely so the count is on a dashboard before it becomes a 3am page.
The fix, and the rule I took from it
The fix is to make the send cancellable too, so a worker that can't deliver its result gives up instead of blocking.
go func(j Job) {
select {
case out <- doWork(j):
case <-ctx.Done():
}
}(j)
Now when the context is done, the collector returns and the workers return, because their send has somewhere to go: the ctx.Done() case. Nothing is left parked. You could also give out a buffer of len(jobs), which means every send can complete whether or not anyone's listening, and the orphaned workers run to completion and exit. Both work. I went with the select because it stops the wasted work sooner, not just the wasted memory.
The rule I wrote on a sticky note and stuck to the side of my monitor: every goroutine needs a defined way to stop. If you can't say, in one sentence, what causes a goroutine to return, you have a leak waiting for the right Tuesday. A for { select } loop with no ctx.Done() case, a send to a channel nobody is guaranteed to read, a receive from a channel nobody is guaranteed to close: all the same bug wearing different clothes.
Goroutines really are cheap. Forgetting to stop them is not.