Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the leak was a map that only ever grew

A slow memory climb in a long-running service turned out to be a cache map keyed by request ID that nothing ever removed entries from.

A terminal showing a steadily climbing memory graph

The service didn't crash. It just got fat. RSS climbed in a clean, patient diagonal that took about four days to go from comfortable to alarming, at which point it got OOM-killed and the cycle started again. A sawtooth on the memory graph is one of the most honest signals you'll ever get: something is being added and never removed.

I went looking with a heap profile, expecting some horror in a third-party library. It was worse than that, because it was mine. A map[string]*result keyed by request ID, used as a little in-flight cache so concurrent handlers wouldn't redo the same work. Entries went in. Entries never came out. Every unique request ID we'd ever seen since the last restart was still sitting in there, holding its result, holding everything the result pointed at.

The original code deleted from the map in a defer at the end of the handler. Somewhere in a refactor a few months back, the defer had moved inside an if branch that didn't always run. So on the common path, the entry was inserted and then orphaned. Classic. The map wasn't a cache, it was a leak with good intentions.

The fix was a one-liner: move the delete back out to an unconditional defer right after the insert, so the entry's lifetime matches the request's no matter which path the handler takes. I added a metric exporting the map's length too, because a cache that's meant to stay small should be screaming the moment it doesn't, rather than letting the kernel be the one to tell me four days later.

The leak wasn't the map. The map was fine. The leak was a delete that had quietly wandered behind a conditional, and the lesson, again, is that the dangerous bugs are the ones that look like exactly what they're supposed to be.