Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

threading context through a worker pool, not just an http handler

Plumbing context.Context through a background worker pool so a shutdown signal actually reaches the goroutines, and the deadlock I caused on the way.

Go source on a terminal

Most writing about context.Context uses an HTTP handler as the example, because that's where it's easiest to see: r.Context() is handed to you, you thread it down, the client disconnects, the work stops. I'd internalised that. What I hadn't internalised was how to do the same thing for work that has no HTTP request at the top of it: a background worker pool chewing through a queue, where the "please stop" doesn't come from a client hanging up but from the process being asked to shut down.

I learned it by getting it wrong and deadlocking the whole thing on shutdown. Worth writing down.

where the context comes from when there's no request

In a worker pool there's no r.Context() to inherit. So you make the root context yourself, at the top of main, and cancel it on a signal. That root is the thing every worker watches.

ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer cancel()

This is the parallel to r.Context() for a daemon. The HTTP server makes a per-request context for you; for a long-running process you make a per-process one, and SIGTERM is its "client disconnect". Once that exists, the discipline is the same as ever: ctx is the first argument, you pass it down, you select on ctx.Done() wherever you'd otherwise block forever.

func worker(ctx context.Context, jobs <-chan Job) {
	for {
		select {
		case <-ctx.Done():
			return
		case job, ok := <-jobs:
			if !ok {
				return
			}
			process(ctx, job)
		}
	}
}

The worker now has two ways to stop: the jobs channel closing (no more work) or the context cancelling (stop now, even mid-queue). Both matter. The first is a clean drain, the second is what SIGTERM triggers.

A diagram of a dispatcher feeding several worker goroutines, with a cancel signal reaching all of them

the deadlock i caused

Here's where I tripped. My dispatcher fed the jobs channel in a loop, and on shutdown I cancelled the context and then waited on a WaitGroup for the workers to finish. But the dispatcher was still trying to push jobs into the channel, the workers had already returned on ctx.Done() so nobody was reading, the channel filled, the dispatcher blocked on a send forever, and the WaitGroup it was part of never completed. Clean shutdown turned into a hang that systemd eventually killed with SIGKILL, which is precisely the graceless exit I was trying to avoid.

The fix was to make the dispatcher honour the context too, on its send side:

select {
case <-ctx.Done():
	return
case jobs <- job:
}

A send that can block is just another blocking operation, and every blocking operation in a cancellable program needs a case <-ctx.Done() next to it. I'd remembered to give the workers an escape and forgotten to give the producer one. The cancellation reached half the system and stalled on the other half.

the lesson, restated for daemons

The mental model I had for HTTP, context is the channel for "nobody wants this work any more", carries straight over to background work. The only difference is that you create the root yourself and the cancellation comes from a signal rather than a socket. And the discipline tightens slightly: it isn't enough to make your consumers cancellable, you have to make your producers cancellable too, because a blocked send is a stuck goroutine just as surely as a blocked receive is.

signal.NotifyContext is the tidy piece that ties it together, turning a signal directly into a context cancellation with no manual signal.Notify boilerplate. With that at the top and a ctx.Done() case on every blocking operation underneath, the process now drains its queue on SIGTERM, stops cleanly, and exits before systemd loses patience. Which is all I wanted in the first place, an afternoon of plumbing ago.