The symptom was a sawtooth that never came back down. Memory on the service climbed steadily over about four days, hit the limit, the orchestrator killed the pod, and the cycle started again. It was not a crash so much as a slow tide, and because the restart was clean nobody had felt enough pain to chase it. The graph just sat there, climbing, accusing me.
finding it
My first guess was the usual suspect: goroutines piling up because something forgot to cancel. So I scraped the pprof endpoints, which is the first thing I should always do and the last thing I ever remember to.
go tool pprof http://localhost:6060/debug/pprof/heap
The goroutine count was flat and boring. The heap profile was not. Nearly all the live allocation traced back to a single map, sitting behind a mutex, in a struct I had written months earlier and forgotten. It was a cache. I had been very pleased with it at the time.
the actual bug
The map keyed cached results by a request signature. Every distinct request added an entry. Nothing ever removed one. I had written the part that checks the cache and the part that fills it, and somewhere in my head I must have assumed eviction would simply happen, the way you assume the washing up will sort itself out.
The keys were not bounded either. They included a timestamp bucket and a user identifier, so the space of possible keys was effectively infinite. Every new combination minted a permanent entry. It was not a cache. It was a log that pretended to be a cache, written in the worst possible storage medium, growing forever until the kernel intervened.
the fix, and the lesson
The proper fix was to swap the naive map for an LRU with a hard size cap, so old entries fall out as new ones arrive and memory finds a ceiling. A handful of lines, a dependency I should have reached for in the first place, and the sawtooth flattened into a sensible plateau within a day.
The thing I keep relearning is that "cache" is a promise with two halves. Adding is the easy, gratifying half. Removing is the half that actually makes it a cache rather than a leak with good PR. Any time I write a map that lives for the lifetime of the process, the very next question now is: what takes things out of it, and what stops it growing without bound? If I cannot answer both, I have not written a cache. I have written next week's incident.