three days for a missing mutex

A terminal full of logs with a bug somewhere in the noise

Three days. That is what one missing lock cost me, and I want to write it down while the shame is still fresh enough to be instructive, because the shape of this bug is one I will absolutely meet again wearing a different hat.

The symptom was a service that mostly worked and occasionally, under load, returned nonsense. A counter that should only ever go up would briefly read lower than a value we had already seen. A map lookup would return a value that belonged to a different key. None of it reproduced on my machine. All of it reproduced in production, but only when traffic was high, which is precisely when you least want to be attaching a debugger.

the part where I made it worse

My first instinct, as ever, was to add logging. Print the counter before, print it after, print the key and the value either side of the lookup. And here is the cruel joke of concurrency bugs: the logging made it go away. Not fixed, hidden. Adding a fmt.Println does just enough work, takes just enough time, that the two goroutines stop interleaving in the unlucky way, and the bug retreats into the dark muttering that it will be back. I spent the better part of the first day "fixing" it with print statements and feeling clever, then watching it return the moment I removed them.

Source code with two goroutines touching the same shared state

This is the tell for a race condition, and I should have recognised it on day one instead of day two. If a bug disappears when you observe it, you are not looking at a logic error. You are looking at a timing problem, and you have a Heisenbug: the act of measurement changes the result. The moment you accept that, you stop trying to read the code into confessing and you reach for a tool that actually understands time.

the tool I should have used first

For Go that tool is the race detector, and it is almost free to turn on:

go test -race ./...
go build -race -o service-race .

I ran the race build against a load test on a staging box, threw a few thousand concurrent requests at it, and within seconds it printed exactly what I had failed to find by hand:

WARNING: DATA RACE
Write at 0x00c0000b4010 by goroutine 34:
  main.(*Cache).Set()
      /app/cache.go:48 +0x64

Previous read at 0x00c0000b4010 by goroutine 21:
  main.(*Cache).Get()
      /app/cache.go:39 +0x44

Two goroutines, one map, one of them writing while the other read, and no lock between them. The race detector does not guess. It instruments memory access and reports the actual unsynchronised pair, with both stack traces, the read and the write. Three days of squinting, answered in seconds by the thing I should have reached for first.

the actual bug

The code was the sort of thing that looks fine in review and is fine right up until two requests land at once.

type Cache struct {
    data map[string]int
}

func (c *Cache) Get(k string) int   { return c.data[k] }      // read
func (c *Cache) Set(k string, v int) { c.data[k] = v }        // write, no lock

A plain map, shared across goroutines, read and written with nothing guarding it. Go maps are explicitly not safe for concurrent use, and one concurrent read-and-write is enough to corrupt the internal structure, which is where the values-from-the-wrong-key weirdness came from. The fix is the boring, correct one: a mutex, held for the whole of each access.

type Cache struct {
    mu   sync.RWMutex
    data map[string]int
}

func (c *Cache) Get(k string) int {
    c.mu.RLock()
    defer c.mu.RUnlock()
    return c.data[k]
}

func (c *Cache) Set(k string, v int) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.data[k] = v
}

The fix in place, locks around the shared map

An RWMutex because reads vastly outnumbered writes and I did not want every reader queuing behind every other reader. Under the load that triggered the original bug the read lock gave back most of the contention I would have lost to a plain Mutex, and the race detector went quiet.

what I am taking away

Two things, written on a metaphorical sticky note. First, if a bug vanishes when you add logging, stop adding logging. That is the universe telling you it is a timing bug, and printf will only ever hide it. Reach for the race detector, or thread sanitiser, or whatever your language gives you, on day one, not day three.

Second, the real fix was not the mutex. The mutex was the patch. The real fix was admitting I had never decided who owned that map. Shared mutable state with no clear owner is a race waiting for enough traffic to find it, and the lock just makes the waiting safe. Next time I will decide the ownership before I write the struct, and ideally let one goroutine own the state and have the others talk to it over a channel, so there is nothing to lock because there is nothing shared. The mutex is the cure. Clear ownership is the prevention, and prevention does not cost three days.