Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the leak was a map nobody ever deleted from

A slow memory leak in a long-running Go service turned out to be a cache map that only ever grew, and the fix was three lines.

A terminal showing a bug

The service did not crash. That was the annoying part. It just got slower, and the pods got OOM-killed somewhere around the four-day mark, restarted clean, and started the climb again. A sawtooth on the memory graph with a period of about ninety hours. Nobody noticed for weeks because the restarts were tidy and the alert threshold was set just high enough to never fire before Kubernetes reaped it.

The graph told the whole story once I actually looked at it: memory rose linearly, never plateaued, never came back down under load. That is not a leak in the C sense, this is Go with a garbage collector. It is something holding references that should have been let go. A linear, unbounded climb that tracks traffic almost always means a collection that only ever grows.

So I reached for pprof, which is the only sensible first move here.

go tool pprof http://localhost:6060/debug/pprof/heap
(pprof) top

The top line was a map[string]*sessionState inside a struct we used as an in-memory cache. We added entries when a session started. We read them throughout. We never, anywhere, deleted them. Every session that had ever passed through the process was still sat in that map, holding a little struct and a couple of buffers, forever.

Lines of source code on screen

It is such an ordinary mistake. The map was written by someone (me, the blame says me) who was thinking about the happy path of looking things up, and never about the fact that things end. A cache without an eviction policy is not a cache, it is a memory leak with good intentions. The session ended, the user went home, and their entry sat in our heap waiting for an OOM kill it would never personally witness.

The fix was three lines: delete the entry when the session closed, and for safety against sessions that never close cleanly, a periodic sweep that drops anything older than its TTL.

func (c *cache) end(id string) {
    c.mu.Lock()
    delete(c.sessions, id)
    c.mu.Unlock()
}

That was it. The sawtooth flattened into a line. Memory now sits at a steady few hundred megabytes that rises and falls with concurrent sessions, exactly as it should, and the pods have not been OOM-killed since.

The lesson is not "remember to delete from your maps", though, fine, do that. The real lesson is that any unbounded collection in a long-running process is a leak you have not noticed yet. If something can be added to and the only removal path is process restart, you have written a slow timer that ends in an OOM. Now when I review anything with a map that lives longer than a request, I ask one question: what deletes from this, and when? If the answer is a shrug, that is the bug, sat there, waiting for day four.