Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the leak was a map, and the map was me

A Go service that grew RSS forever turned out to be holding every request key it had ever seen in a map nobody pruned.

A terminal full of profiling output

The service worked perfectly for about four days and then fell over. That was the whole shape of the bug. Deploy it, watch it serve traffic happily, and somewhere around day four the OOM killer would take it out, systemd would restart it, and the clock would reset. Nobody was paged because the restart was clean and fast. It just quietly ate memory until the kernel intervened, then started eating again.

A four-day cycle is a miserable thing to debug. You can't sit and watch it. You can't reproduce it on your laptop in an afternoon, at least not without faking the traffic, and faking traffic is exactly the thing that hides the bug, because the bug is about the shape of real traffic, not the volume.

ruling out the obvious

First I did the responsible thing and looked at the graphs. RSS was a straight line up. Not sawtooth, not spiky: a clean, almost insultingly linear climb from a couple of hundred megabytes at start to two gigabytes by the time it died. Linear growth against a roughly constant request rate is a strong hint. Something is accumulating in proportion to requests served, and nothing is ever letting go.

That rules out a lot. It's probably not a goroutine leak, because those tend to step up in chunks. It's probably not connection pool churn. Linear-with-traffic almost always means "we keep adding to a thing and never remove from it."

It also told me roughly where to look. A leak that scales with bytes processed would track throughput, not request count, and the curve would wobble with payload size. This one didn't care how big the requests were, only how many there had been, and specifically how many distinct ones. That's a different fingerprint, and it points away from buffers and towards bookkeeping: some structure indexed by a request property, growing one entry at a time. I didn't know which structure yet, but I knew its shape before I'd read a single line of the actual code, which is the part of debugging that feels like cheating when it works.

The service was Go, so I reached for the obvious tools. pprof was already wired into the admin port, which past me deserves a small amount of credit for. I grabbed a heap profile from a process that had been up for two days:

go tool pprof http://localhost:6060/debug/pprof/heap
(pprof) top

The top of the list was a single map's backing store. Not a surprising library, not a buffer, not some clever cache I'd forgotten. A plain map[string]*session that I had written, by hand, that was holding 1.9 million entries.

A flame graph of heap allocations

the map that only ever grew

Here is the offending pattern, cleaned up a little:

type tracker struct {
    mu       sync.Mutex
    sessions map[string]*session
}

func (t *tracker) seen(key string) *session {
    t.mu.Lock()
    defer t.mu.Unlock()
    s, ok := t.sessions[key]
    if !ok {
        s = &session{firstSeen: time.Now()}
        t.sessions[key] = s
    }
    s.lastSeen = time.Now()
    return s
}

Read it and the bug is right there in what's missing. There is an insert. There is a lookup. There is no delete. The session struct even has a lastSeen field, which is the tell: at some point past me had clearly intended to expire old entries, written the field to make that possible, and then never written the code that used it. The intention was in the type and nowhere else.

Every unique key got a permanent home in that map. And the keys were per-client identifiers, so for a service on the open internet the set of keys was effectively unbounded. It wasn't a leak in the C sense of unreachable memory. Every byte was perfectly reachable, which is worse, because a leak detector would never flag it. The map was doing exactly what a map does. I had just never told it to forget anything.

The four-day cadence finally made sense too. Two gigabytes divided by the steady rate of new unique clients came out to roughly four days. The bug wasn't intermittent at all. It was perfectly deterministic. It only looked random because the period was long enough to feel like bad luck rather than arithmetic.

the fix, and the fix to the fix

The naive repair is a background goroutine that walks the map and deletes anything whose lastSeen is older than some TTL:

func (t *tracker) reap(ttl time.Duration) {
    cutoff := time.Now().Add(-ttl)
    t.mu.Lock()
    defer t.mu.Unlock()
    for k, s := range t.sessions {
        if s.lastSeen.Before(cutoff) {
            delete(t.sessions, k)
        }
    }
}

That stopped the bleeding, and for an honest day's work that's where I'd have left it. But there's a subtlety worth knowing: deleting keys from a Go map shrinks the logical size, but it does not return the backing array's memory to the runtime. The map's internal buckets stay allocated at their high-water mark. So after a traffic spike pushed the map to four million entries and the reaper pruned it back to fifty thousand, RSS did not come down. It plateaued at the worst it had ever been.

For most services that's fine. A plateau is not a leak, and a plateau you can capacity-plan around. But if the high-water mark is genuinely too big, the only way to reclaim it is to periodically rebuild the map into a fresh one and swap it in, letting the old one be collected wholesale. I did that on a slow timer, guarded so it only fires when the live count is a small fraction of the allocated buckets. It's a bit ugly. It is also the kind of ugly that keeps RSS flat.

what I actually took away

The lesson isn't "use a cache library", though honestly, an off-the-shelf LRU with a size bound would have prevented the entire incident and I've since swapped one in. The lesson is that an unbounded collection fed by external input is a memory leak wearing a respectable suit. It passes review because every line is correct. It passes the race detector. It passes every test, because tests don't run for four days under real client diversity.

The lastSeen field that I'd added and never used was the real evidence. Past me knew this needed expiry. Past me wrote down the data to support it and then got distracted before writing the behaviour. When you find a struct field that nothing reads, treat it as a confession. Somebody meant to do something and didn't, and four days later it's the kernel's problem.