Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the leak was a map i kept adding to and never deleting from

Chasing a slow memory climb in a Go service down to a cache map that only ever grew, and the fix that took one line.

A terminal showing a process being debugged

The service did not crash. That is the first thing that made it annoying. It just got slowly, inexorably fatter over about four days, RSS climbing in a line so steady you could have set a watch by it, until the kernel's OOM killer stepped in and put it out of its misery. Then it restarted, served traffic happily, and began the same patient climb all over again. A sawtooth on the memory graph, four days a tooth.

A leak that takes four days to matter is the worst kind, because you cannot iterate against it. You make a change, you deploy, and then you wait the better part of a working week to find out whether it helped. By the third tooth I stopped guessing and started measuring properly.

the trap of "restarting fixes it"

The honest confession is that we had a cron job restarting the thing every night for a fortnight before I admitted it was a leak. The nightly bounce kept RSS under the ceiling, the graph looked like a tidy sawtooth, and nobody was paged. It is very easy to let a workaround become the fix, especially when the workaround is invisible and the real fix means actually understanding the problem. The day the nightly restart slipped and the OOM killer beat it to the punch was the day I had to stop pretending.

where the memory actually was

It is a Go service, so I had pprof available, which is one of the genuinely good things about the language. I wired up the heap profiler, hit the endpoint on a process that had been running about two days, and pulled the profile down:

go tool pprof http://localhost:6060/debug/pprof/heap

Then top inside the tool to rank allocations by inuse space. One line dominated, and it was not even close. A map[string]*entry that some long-ago version of me had introduced as a cache. The profile pointed straight at the function that populated it.

A small but important detail: I took two profiles, a couple of hours apart, and compared them with the -base flag. That comparison is the part people skip, and it is what turns "this map is big" into "this map is the thing growing." A snapshot tells you what is large right now, which might just be a legitimately large buffer. The diff between two snapshots tells you what is leaking, because a leak is by definition the thing that keeps getting bigger while everything else holds steady. The base map was already large on the first profile and visibly larger on the second, with nothing else moving. That was the smoking gun, and it took about thirty seconds once I knew to look for it.

A close-up of source code on a screen

The cache was keyed on a request identifier. Every distinct request added an entry. That was the entire bug, sitting there in plain sight once the profiler told me where to look. There was code to add to the map. There was code to read from the map. There was, anywhere, no code to ever remove anything from the map. The keys were unbounded by construction, because request identifiers are unique, so the map's job was effectively to remember every request the process had ever seen until the heat death of the host.

Here is the shape of it, lightly anonymised:

type cache struct {
    mu      sync.Mutex
    entries map[string]*entry
}

func (c *cache) put(id string, e *entry) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.entries[id] = e // grows forever, never trimmed
}

The original intent had been to deduplicate work within a request's lifetime, which is a sensible thing to want. The mistake was reaching for a plain map to do it and then never giving any thought to eviction. A cache without an eviction policy is not a cache. It is a memory leak with good manners.

Lines of code with a function highlighted

the fix, and the better fix

The one-line fix was to delete each entry when the request that owned it finished, which is all the cache was ever meant to live for. The map went from "every request ever" to "requests currently in flight," which is bounded by concurrency rather than by uptime. The sawtooth flattened into a line. I deleted the nightly restart cron the same afternoon, with no small amount of satisfaction.

The better fix, the one I actually shipped a week later once I trusted the diagnosis, was to stop hand-rolling caches at all and use a library with a real eviction policy and a size bound. If a cache cannot grow past N entries, the worst case is bounded and visible, and the OOM killer never gets a vote.

There was a subtler trap hiding underneath, which I only noticed because I went looking. The entries held pointers to a struct that itself held a slice. Even if I had cleared the top-level map, a stale reference to one of those structs held elsewhere would have kept its whole slice alive, because Go's collector keeps everything reachable, and reachability is transitive. The leak was a map, but it could just as easily have been a forgotten goroutine holding a channel, or a slice I resliced down without copying so the backing array stayed pinned. They all reduce to the same sentence: something is still reachable that I have mentally finished with. The garbage collector is not psychic. It cannot tell the difference between "I still need this" and "I forgot to let go."

That is the real reason I now reach for a bounded cache by default rather than a bare map. It is not that the map is wrong, it is that the map makes the unbounded case the easy case and the bounded case the thing you have to remember to write. Good defaults invert that. The library makes "this has a maximum size" the path of least resistance, and the failure mode becomes an eviction you can observe rather than a climb you have to wait four days to even see.

The lesson is not "Go leaks," because it does not, the garbage collector did exactly its job: that map was reachable, so it was kept, correctly and forever. The lesson is that the garbage collector cannot save you from a data structure you keep a live reference to and never prune. Reachable memory is not free memory. Any time I add something to a long-lived collection now, the very next question I make myself answer is who removes it, and when. If I cannot answer that, I have not written a cache. I have written next week's incident.