The service did not crash. That was the problem. A crash you can catch, a crash leaves you a core and a timestamp and someone awake at 3am with a grudge. This thing just got slower and fatter over about a week, and then somebody restarted it because that is what you do, and the clock reset, and nobody wrote it down. I only noticed because I happened to graph go_goroutines over a fortnight and the line had the unmistakable shape of a thing that goes up and never comes down.
Up and to the right is lovely on a revenue chart. On a goroutine count it means you have built a sieve.
what it was meant to do
The worker did something unremarkable. A request came in, it fanned out to a handful of downstream calls, collected the results, and returned the first useful one. The classic shape:
func query(ctx context.Context, hosts []string) (Result, error) {
results := make(chan Result)
for _, h := range hosts {
go func(h string) {
r, err := fetch(h)
if err != nil {
return
}
results <- r
}(h)
}
select {
case r := <-results:
return r, nil
case <-ctx.Done():
return Result{}, ctx.Err()
}
}
Read it quickly and it looks fine. Fan out, take the first one home, respect the context. I wrote something very close to this and felt quite pleased with myself, which in hindsight should have been the first warning.
The bug is the channel. results is unbuffered. The select takes exactly one value off it, the winner. Every other goroutine that managed to fetch successfully then sits forever on results <- r, because there is no longer anybody reading. One request, several hosts, one reader: the rest of the senders block on the send and never return. They are not deadlocked in any way the runtime will complain about. They are just politely waiting for a reader that went home hours ago.
So every request leaked roughly len(hosts) - 1 goroutines, each holding its own little stack and whatever Result it was clutching. At a few requests a second that adds up to the kind of slope I eventually noticed on the graph. The memory was the goroutine stacks and the captured results, not some exotic heap fragmentation. It was just thousands of small things that should have ended and didn't.
What made it hard to spot is that the function worked perfectly. The caller got the right answer, fast, every single time. There was no error, no slow request, no failing test. The only symptom lived entirely outside the request that caused it, in a metric on a dashboard nobody was watching, accruing a fraction of a megabyte at a time. The bug had no victim until it had all of them at once, which is the worst shape a bug can have, because by the time it's visible it's already been happening for a week.
finding it
pprof made it embarrassingly obvious once I actually looked.
go tool pprof http://localhost:6060/debug/pprof/goroutine
Then top, and there it was: a few thousand goroutines all parked in the same line, all on a channel send, all in query. When the same line of your own code appears two thousand times in a goroutine dump, you do not need a hypothesis. You have a confession.
I had assumed, lazily, that a goroutine which finishes its work finishes. It does, but only if it can complete its last statement, and a blocked send is not completion. The runtime will happily keep a goroutine alive forever if the thing it is waiting on never happens. There is no garbage collector for "intent". Nobody is coming to tidy up a goroutine that is blocked on a send; that is your job, and the tool for the job is the context you already have in your hand and were not using on the send side.
the fix
Two changes, either of which alone would have been enough, both of which I made because once is a bug and twice is a habit.
First, buffer the channel so the losing senders never block:
results := make(chan Result, len(hosts))
With a slot per sender, every goroutine can deposit its result and return even if nobody ever reads it. The buffered results get collected by the garbage collector along with the channel once query returns. No reader required, no parking, no leak.
Second, and more importantly, give the senders the context too, so a send can lose gracefully instead of blocking:
go func(h string) {
r, err := fetch(ctx, h)
if err != nil {
return
}
select {
case results <- r:
case <-ctx.Done():
}
}(h)
Now a sender either lands its result or notices the context is done and leaves. The buffered channel handles the common case cheaply; the ctx.Done() arm handles the case where the caller has given up entirely and torn the context down. Belt and braces. I have come to prefer belt and braces in any code that spawns goroutines, because the failure mode of getting it wrong is not a loud one. It is a slow one, the worst kind.
The other thing I did, which I'd recommend to anyone running a Go service in anger, was wire the goroutine count straight onto the dashboard next to the request rate and the memory. It's one line of instrumentation and it turns this entire class of bug from "discovered by accident a fortnight later" into "the line goes up the moment you ship it". A flat goroutine count under steady load is one of the most reassuring graphs I know. A climbing one is the service telling you, quietly and in advance, that you've made exactly this mistake somewhere.
Passing ctx into fetch mattered too, since the abandoned downstream calls were themselves holding connections open for no reason. A leaked goroutine is rarely just a goroutine. It is a goroutine and everything it is still gripping.
the thing I keep relearning
Every goroutine you start is a small promise to make it finish. The compiler does not hold you to that promise. The race detector won't catch it, because there is no race; the senders are perfectly well-behaved, they are just waiting for a reader who is never coming. The only tools that catch it are a goroutine profile and the slightly paranoid instinct to ask, of every go func, "and how does this one end?"
If you cannot answer that question for a goroutine, you have not written a worker. You have written a leak that hasn't started leaking yet.