Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

learning to thread context.Context all the way down

A practical account of wiring context.Context through a Go service properly, what it buys you, and the mistakes I made first.

A screen of Go source code

I spent a long time treating context.Context as a thing you accept because the linter wants you to, then ignore. You take a ctx, you pass it to the database driver, and you never think about it again. That works right up until something goes wrong slowly: a request that should have been cancelled keeps running, a deploy hangs on shutdown because forty goroutines are still blocked on a call nobody is waiting for, a downstream service falls over and your service cheerfully keeps hammering it. All of those are the same bug, and the fix is the same discipline. Thread the context through, honestly, all the way to the bottom.

The point of context is not the value bag. It's the cancellation signal. When the thing that asked for work goes away, every goroutine doing that work should find out and stop. That only happens if the ctx actually reaches the blocking call. One context.Background() halfway down the stack and the chain is broken; everything below it is now uncancellable, and you usually find out at the worst time.

the rule I settled on

Context is the first argument, it's named ctx, and you never store it in a struct.

func (s *Service) Fetch(ctx context.Context, id string) (*Record, error) {
    row := s.db.QueryRowContext(ctx, selectByID, id)
    // ...
}

That QueryRowContext matters. The plain QueryRow ignores cancellation entirely. Most stdlib and library calls that can block now have a ...Context variant, and if a library doesn't take a context at all in 2023, I treat that as a smell worth a second look. The whole value of the pattern is that the leaf calls respect it. A context that's threaded perfectly through your own code but handed to a non-context driver method is decoration.

Storing context in a struct is the tempting shortcut, and it's wrong for a simple reason: a struct outlives a single call, and a context is scoped to one. Stash it on the struct and you've frozen one request's deadline onto an object that serves many. The compiler won't stop you. go vet will grumble, and it's right to.

where the boundaries are

You create a context at the edges and derive everything else from it. In an HTTP handler, the request already carries one:

func (h *Handler) Get(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    rec, err := h.svc.Fetch(ctx, chi.URLParam(r, "id"))
    if err != nil {
        // r.Context() is already cancelled if the client went away
    }
}

r.Context() is cancelled when the client disconnects, which is free cancellation propagation if you actually use it. The mirror of that is the outbound call, where you usually want to add a deadline so you're not at the mercy of someone else's timeouts:

ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
resp, err := client.Do(req.WithContext(ctx))

The defer cancel() is not optional, and this is the mistake I made for ages. WithCancel, WithTimeout and WithDeadline all return a cancel func, and if you don't call it you leak the timer and the context until it expires on its own. go vet flags the obvious cases now; it won't catch every one. Just call cancel. Always defer it on the line after you create the context, so the two never drift apart.

the bottom of the stack

The part that took me longest to internalise: cancellation only does anything if something is actually selecting on it. For your own loops and waits, that means watching ctx.Done():

func (w *Worker) run(ctx context.Context) error {
    ticker := time.NewTicker(time.Second)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-ticker.C:
            if err := w.tick(ctx); err != nil {
                return err
            }
        }
    }
}

Without the case <-ctx.Done(), this worker ignores every cancellation in the system. The context being passed in is doing nothing. I had a few of these: contexts dutifully threaded through, never once consulted. They looked correct in review because the plumbing was all there. The plumbing isn't the point. Someone has to read the signal at the end of it.

A diagram of a request flowing through service layers

what it bought us

Once it was honest from edge to leaf, two long-standing annoyances just went away. Deploys got faster, because graceful shutdown actually drained: cancel the root context, every in-flight request unwinds, the process exits in a second or two instead of waiting out a hard timeout. And a downstream outage stopped cascading, because a cancelled request stopped its own downstream calls instead of piling them up behind a dead service.

None of this is clever. It's just consistent. Context is first arg, named ctx, never stored, derived at boundaries, cancel always deferred, and somebody at the bottom actually selects on Done(). Get those right and a lot of vague distributed-systems unpleasantness turns into ordinary, well-behaved code. Get them wrong and you have a service full of work that nobody can stop.