Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

threading context.Context through, and finally understanding why

How a leaked goroutine and a hung request taught me what context.Context is actually for, and the discipline of threading it through every call that can block.

Program code on a screen

For a long time context.Context was the parameter I copied from the signature above and passed along without really understanding. ctx context.Context, always first, everyone does it, fine. Then a service started leaking goroutines under load, and learning why taught me what context is actually for. It isn't a bag for passing request-scoped values around, though people abuse it for that. It's a cancellation signal you thread through your call graph so that when the caller gives up, everything downstream gives up too.

the bug that explained it

We had an HTTP handler that called a backend service, which called a database. A client would connect, the backend would be slow, the client would time out and disconnect, and our handler would carry on regardless: still waiting on the backend, still holding a database connection, still occupying a goroutine. Under a burst of impatient clients the goroutine count climbed and didn't come back down. The work was orphaned. Nobody was waiting for the answer any more, but the machine was still computing it.

The reason is that none of our inner calls knew the client had gone. The cancellation never propagated, because we weren't passing the thing that carries it.

Lines of source code

what context actually does

A Context carries a deadline and a cancellation signal. net/http gives you one per request via r.Context(), and crucially it's already cancelled when the client disconnects. The whole job is to take that context and pass it down through every function that might block, so that a cancellation at the top unwinds the whole stack.

The wrong version, the one we had:

func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    result, err := h.backend.Fetch(userID)   // no ctx, no escape
    if err != nil {
        http.Error(w, err.Error(), 500)
        return
    }
    json.NewEncoder(w).Encode(result)
}

Fetch has no idea anyone might want it to stop. The right version threads r.Context() all the way down:

func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    result, err := h.backend.Fetch(ctx, userID)
    if err != nil {
        if ctx.Err() == context.Canceled {
            return // client's gone, nothing to send to
        }
        http.Error(w, err.Error(), 500)
        return
    }
    json.NewEncoder(w).Encode(result)
}

func (b *Backend) Fetch(ctx context.Context, id string) (*Result, error) {
    row := b.db.QueryRowContext(ctx, "SELECT ... WHERE id = $1", id)
    // ...
}

The key is QueryRowContext rather than plain QueryRow. The database/sql package, and any library worth using, takes a context on every blocking call. When the client disconnects, ctx is cancelled, the query is cancelled, the connection goes back to the pool, the goroutine unwinds and returns. The orphaned work simply stops.

the rules I now follow

A handful of things I had to learn the hard way, mostly by getting go vet to shout at me.

  • Context is the first parameter, always named ctx, and you never store it in a struct. It flows through call arguments, not state.
  • Never pass nil as a context. If you genuinely have nothing, pass context.TODO(), which is a marker that says "I haven't wired this up yet" and is greppable later.
  • If you create a context with context.WithCancel or context.WithTimeout, you must call the returned cancel function, normally with defer cancel(), or you leak the context's resources. This is the easy one to forget.
  • A function that does anything blocking, network, disk, a channel receive, should take a context and respect it. A pure CPU function that returns quickly doesn't need one.
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
result, err := h.backend.Fetch(ctx, userID)

That WithTimeout adds a deadline on top of the inherited cancellation, so even if the client waits patiently, we don't hang on a dead backend for ever. The deadline propagates down exactly like cancellation does.

the shift in how I think about it

What clicked is that context isn't bureaucracy you thread through to satisfy a convention. It's the wiring that makes "stop, nobody wants this any more" travel from the top of your call stack to the bottom. Without it, every layer is an island that finishes its work whether or not anyone's still listening. With it, a disconnect at the edge ripples all the way down and frees everything.

The goroutine leak vanished the day we threaded the context properly. The graph that used to climb and never recover now climbs under load and falls straight back, because the moment a client gives up, so does everything we'd started on its behalf. That's the whole point of the parameter I'd spent two years copying without reading.