Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the leak was a map, and the map was me

Chasing a slow memory leak in a long-running Go service that turned out to be a cache map I added entries to and never removed.

A terminal showing a bug being traced

A service had been climbing in memory for weeks. Not fast, not alarming on any given day, but plot RSS over a fortnight and the line never came back down. Restart it and the clock reset. Classic leak shape, the kind that hides because nothing ever crashes in the time anyone is looking.

My first instinct was the usual suspects: goroutines that never exit, a channel nobody drains, a time.Ticker left running. So I started there, because that is where I have been bitten before.

go tool pprof http://localhost:6060/debug/pprof/heap

The heap profile pointed somewhere I didn't expect: a single map, holding tens of thousands of entries that had no business still being alive. The goroutine count was flat. It wasn't concurrency at all.

A close look at the offending code path

Then I remembered writing it. Months ago I had added a little in-memory cache to avoid recomputing something per request. Keyed by a request-scoped identifier. I added entries on the way in. I never, anywhere, removed them. Every unique key that ever passed through stayed resident for the life of the process. It wasn't a cache. It was a museum.

Why pprof didn't say "cache"

This is the part worth keeping. pprof tells you where memory is allocated and retained, not what you meant it to be. The line it blamed was the map assignment, which was correct and useless at the same time. The actual bug was the absence of code: the eviction I never wrote. You can't profile something that isn't there. You have to read the lifecycle and ask, for every place you put something in, where does it come out.

The fix

Two options. Bound the thing, or give entries a lifetime. I didn't need the hand-rolled cache to be clever, so I swapped it for a small LRU with a fixed size:

cache, _ := lru.New[string, result](4096)

Fixed capacity, evicts the oldest, done. RSS now climbs to a plateau and sits there, flat as a millpond, which is what a healthy long-running service should look like.

The lesson isn't "maps leak", because they don't. The lesson is that an unbounded collection in a process that never restarts is a leak with extra steps, and the one who wrote it rarely sees it because at the time it looked like a sensible optimisation. Every map you add to over the lifetime of a service needs a story for how things leave. If you can't tell that story, you've found your leak.