Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the leak was a map i kept adding to and never pruned

Tracking a slow Go memory leak down to a per-connection cache map that gained entries forever and never lost them, plus the pprof session that found it.

A terminal showing a memory profiling session

The service didn't crash. It just got slowly fatter, day after day, until the OOM killer took it out around the four-day mark, at which point it restarted and started the climb again. A sawtooth on the memory graph that anyone who's run a leaky daemon will recognise on sight. A leak with a clock on it is at least a predictable leak, but it's still a 3am restart waiting to happen.

Go has a garbage collector, so people are sometimes surprised it can leak at all. It can. The GC frees what's unreachable; it cannot free what you're still holding a reference to. If you keep adding entries to a long-lived map and never delete them, that memory is reachable forever, and the collector is doing exactly its job by leaving it alone. That's not a leak in the C sense, it's a logic bug that walks like one.

Finding it was a matter of turning on pprof, which in a long-running service costs almost nothing and is worth wiring in by default:

import _ "net/http/pprof"

go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

Then grab two heap profiles a few hours apart and diff them:

go tool pprof -base heap.0.pb.gz http://localhost:6060/debug/pprof/heap
(pprof) top

A code profiling diagram

The diff pointed straight at one allocation site. We had a map[connID]*sessionState that recorded some per-connection metadata, keyed by connection ID. Connections came and went all day. We inserted into the map when a connection opened. We never deleted from it when one closed. Every connection that had ever existed was still sat in that map, holding its little sessionState alive, and the GC was dutifully keeping all of it because the map was reachable from a package-level variable.

The fix is one line, in the connection's teardown path:

defer func() {
    mu.Lock()
    delete(sessions, id)
    mu.Unlock()
}()

That's it. Memory went flat. The graph turned from a sawtooth into a boring horizontal line, which is the most satisfying graph in operations.

Two takeaways I keep relearning. First, "managed memory" only manages the memory you've genuinely let go of; a map you keep appending to is an unbounded data structure dressed as a cache. If something lives for the life of the process and only ever grows, give it an eviction policy, a size cap, or a TTL, even a crude one. Second, build pprof in from the start. The difference between a five-minute diagnosis and a five-day one was that the endpoint was already there. Add it before you need it, because you'll only think to add it after.