Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

three days lost to a goroutine that started too early

A service that failed roughly one deploy in twenty with a nil map panic, traced over three days to a background goroutine that began work before initialisation finished, and the channel that fixed it.

A terminal full of stack traces from a crashed service

The bug fired about one deploy in twenty. That ratio is what cost me three days. If it had crashed every time I'd have had it before lunch. Instead it mostly worked, and then occasionally, on startup, it would panic with assignment to entry in nil map and exit, the orchestrator would restart it, the second start would be fine, and life went on. Nobody upstream noticed because the restart was quick. I noticed because I was the one reading the logs, and a panic on boot is the kind of thing that nags.

The first day I did the obvious thing and assumed the deploy was at fault. New config, new image, something racing with the rollout. I rolled back, it stopped, I rolled forward, it came back, intermittently. That told me nothing useful except that it was real and it was in the code, not the infrastructure. A whole day to confirm a bug exists is not a day I'm proud of, but at least it killed the "it's the platform" theory dead.

the panic was lying about where

The stack trace pointed at a map write deep in a metrics collector. The map was a field on a struct, populated in a Load() method that ran at startup. So my reasoning went: the map write panics because the map is nil, the map is nil because Load() hasn't run, therefore Load() is being skipped sometimes. I spent the second morning trying to find a code path where startup skipped Load(). There wasn't one. Load() always ran. I added logging to prove it, and the logging proved it: Load() ran every single time, before the panic.

That's the bit that unsticks these. The map was nil despite Load() running, which only makes sense if something touched the map before Load() finished. Not instead of it. Before it.

The structure was roughly this. A constructor returned the service, then Load() filled in the maps, and somewhere in the constructor a background goroutine was kicked off to do periodic collection:

func New(cfg Config) *Service {
    s := &Service{cfg: cfg}
    go s.collectLoop() // <- starts immediately
    return s
}

func (s *Service) Load() error {
    s.counters = make(map[string]int64)
    // ... populate from cfg ...
    return nil
}

collectLoop woke up on a ticker and wrote into s.counters. The ticker's first tick was normally a second or two away, comfortably after Load() had finished. Normally. But collectLoop also did one collection immediately on entry, before waiting for the first tick, and on a busy host where the goroutine got scheduled promptly, that immediate collection could land in the window between go s.collectLoop() and s.counters = make(...). Nil map. Panic. One deploy in twenty was just "how often the scheduler ran that goroutine before Load() got the CPU back".

A close-up of source code showing a goroutine launched before initialisation

I confirmed it with the race detector, which I should have reached for on day one and didn't, because I'd convinced myself this was a logic bug rather than a data race. go test -race lit it up instantly: a write to s.counters from the collector goroutine, concurrent with the write in Load(). The detector even printed both stacks side by side. Two days of squinting at logs, and the tool that finds exactly this had the answer in under a second. I've written that sentence before about other tools and I'll write it again.

the fix, and the better fix

The cheap fix is to not start the goroutine until initialisation is done. Move the go out of the constructor and into an explicit Start() that the caller invokes after Load():

func (s *Service) Start() {
    s.counters = make(map[string]int64)
    go s.collectLoop()
}

That works, and it's what I shipped first. But I didn't love it, because it relies on every caller remembering the order: construct, load, then start. The compiler won't enforce that. The next person to add a code path can get it wrong, and the failure mode is a one-in-twenty panic that takes three days to find. I've now met that failure mode and I'd rather not introduce a fresh way to summon it.

The better fix made the ordering impossible to get wrong. I gave the loop a ready channel and had it block until the maps existed before doing any work:

func New(cfg Config) *Service {
    s := &Service{cfg: cfg, ready: make(chan struct{})}
    go s.collectLoop()
    return s
}

func (s *Service) Load() error {
    s.counters = make(map[string]int64)
    // ... populate ...
    close(s.ready)
    return nil
}

func (s *Service) collectLoop() {
    <-s.ready // do nothing until Load has closed this
    for range time.Tick(s.cfg.Interval) {
        s.collect()
    }
}

Now the goroutine can start as early as it likes and it simply parks until Load() closes ready. The data dependency is expressed in the code, not in a comment or a convention. The race detector stays quiet, and so do I.

The lesson I keep relearning: an intermittent crash is almost never random. It's a window, and the question is always "what fits inside it". Here the window was the few microseconds between launching a goroutine and finishing initialisation, and the thing that fit inside it was a default tick I'd forgotten the collector did on entry. Three days, mostly because I trusted my reading of the stack trace over the tool that would have just told me. Next time, -race first.