Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

A Memory Leak That Was a Map I Never Cleared

A service that grew its heap by a few hundred megabytes a day turned out to be one map that only ever had entries added to it.

A terminal full of stack traces

The symptom was boring, which is usually how the good ones start. One of our backend services would run fine for about four days, then get OOM-killed and restart. Restart, four days, OOM, restart. We had quietly papered over it with a higher memory limit and a nightly bounce, and it had sat there for months as a known-and-tolerated thing. That is the dangerous category. Nobody is on fire, so nobody looks.

I finally looked because the nightly bounce started landing in the middle of a batch job and corrupting its output. So the leak became my problem on a Tuesday morning, with coffee, which is the only honest way to debug.

The graph that gave it away

The first useful thing was just plotting RSS over a week. Not the smoothed dashboard version, the raw per-pod number. It was a staircase. Flat, flat, flat, small step up, flat. Every step lined up with traffic, but it never came back down. A leak that tracks traffic and never recovers is almost always something you are adding to and never removing from. Not a buffer, not GC pressure, an actual growing collection.

A heap profile climbing in steps

This is a Go service, so I reached for pprof. We already expose /debug/pprof behind an internal-only listener, which past-me deserves a small amount of credit for. Grabbing a heap profile in production and comparing it to one taken an hour later is the whole game:

go tool pprof -inuse_space -base heap-1300.pb.gz heap-1400.pb.gz

The -base flag is the bit people forget. It diffs the two profiles, so you see what grew in that hour rather than everything that exists. The top entry was unambiguous: a single map[string]*sessionState, sitting behind a struct called dedupeCache, accounting for nearly all the growth.

The map that only knew how to grow

The code was exactly what you would guess once you knew where to look. Someone (and git blame says it was me, in 2023) had added a map to deduplicate inbound events by ID. Write the ID in, check it next time, skip if seen. Sensible. The problem was that there was a path to put entries in and no path to ever take them out. It was a cache with no eviction, which is just a memory leak with good intentions.

func (c *dedupeCache) seen(id string) bool {
    c.mu.Lock()
    defer c.mu.Unlock()
    if _, ok := c.entries[id]; ok {
        return true
    }
    c.entries[id] = &sessionState{at: time.Now()}
    return false
}

Every unique event ID we had ever processed lived in that map until the process died. At our volumes that was a few hundred thousand entries a day, each dragging a pointer to a small struct behind it. It did not grow fast. It grew forever, which is worse.

The fix was not clever and did not need to be. The dedupe only needs to catch retries within a short window, minutes, not the entire history of the universe. So I gave the entries a TTL and added a janitor goroutine that sweeps expired keys on a ticker. A bounded structure with eviction would have been tidier, and I did consider a proper LRU, but the TTL matched the actual requirement and was ten lines instead of a dependency.

The thing I keep relearning is that "cache" is a word that lets you skip the hard question. A real cache has a bound and an eviction policy. The moment you have a map you only ever add to, you have not written a cache, you have written a slow leak and called it something flattering. Heap goes flat now. The nightly bounce is gone, and so is the corrupted batch job.