The graph was a perfect ramp. Resident memory on one of our Go services climbed from about 180MB at deploy to a little over 2GB roughly twelve days later, at which point the OOM killer stepped in and the pod restarted, and the ramp began again. Sawtooth. Textbook. The sort of shape that tells you straight away this isn't a spike, it's an accumulation. Something gets added and nothing ever gets removed.
The annoying part is that we'd lived with it for months by treating the symptom. The pod restarted itself, the memory limit was generous, nobody got paged. It was only when traffic grew and the ramp got steeper that the restarts started landing in the middle of the day, dropping in-flight requests, and somebody finally said the obvious thing: this isn't a memory limit problem, it's a leak.
pprof tells you where, not why
Go makes the "where" embarrassingly easy. The service already exposed net/http/pprof, so I pulled a heap profile off a pod that had been running for a week:
go tool pprof -inuse_space http://localhost:6060/debug/pprof/heap
top put nearly all of the live heap behind a single allocation site, and list on that function pointed at one line: a map[string]*entry being written to inside a request handler. The map was a cache. A perfectly reasonable idea, badly finished.
the cache that only ever grew
Here is roughly what it looked like, with the names changed to protect the guilty (me, in 2021):
var cache = map[string]*entry{}
var mu sync.Mutex
func lookup(key string) *entry {
mu.Lock()
defer mu.Unlock()
if e, ok := cache[key]; ok {
return e
}
e := expensiveLoad(key)
cache[key] = e
return e
}
You can see it immediately once it's in front of you. There is no eviction. There is no TTL, no size bound, no LRU, nothing. Every distinct key we ever saw stayed in that map until the process died. And the keys weren't a small fixed set. They included a customer identifier and a date component, so the keyspace grew without limit over time. The cache wasn't caching, it was hoarding.
It had worked fine in testing because in testing you hit the same dozen keys. It had worked fine for the first few weeks in production because the keyspace was still small. It only became a leak once enough real traffic had passed through to fill it, which is exactly why the ramp took a fortnight rather than an hour.
the fix, and the fix to the fix
The quick fix was to put a bound on it. I reached for an off-the-shelf LRU rather than hand-rolling one, because hand-rolled caches are how you end up writing this post a second time:
cache, _ := lru.New[string, *entry](10000)
That alone flattened the graph. RSS settled at around 350MB and stayed there. The sawtooth was gone within a day.
The fix to the fix was to ask whether we needed the cache at all. The thing it was caching was cheap to recompute and the hit rate, once I actually measured it, was about 40%. So for a chunk of that keyspace we were holding memory hostage for a coin-flip's worth of benefit. We kept the cache because the 40% mattered under load, but the measuring was the point. I'd added it years earlier on a hunch and never gone back to check whether the hunch was true.
The lesson isn't "maps leak", because they don't, they do exactly what you tell them. The lesson is that an unbounded cache is not a cache, it's a memory leak with good intentions. If you put something into a map in a long-lived process, you owe it an answer to one question: what takes it back out again? If you can't name the thing that removes entries, you've not written a cache. You've written a slow-motion OOM.