the leak was a map, and the map was me

A terminal full of profiling output

The symptom was the dullest kind: a service that worked perfectly, then was OOM-killed roughly every four days. No crash, no panic, no helpful stack. Just a process that grew, steadily, until the kernel ran out of patience and reaped it. Restart, and the clock started again.

I'd love to tell you I reached for a profiler straight away. I didn't. I spent the first evening convinced it was the runtime, or cgroup limits, or a leak in a dependency, because surely my own code was fine. It was not fine. It is never the runtime.

Watching it grow

The first useful thing was to stop guessing and start measuring. RSS over time, sampled once a minute and graphed, gave a beautifully straight line going up and to the right. Straight lines are good news in a way: a leak that grows linearly with traffic is usually one allocation site, not a hundred subtle ones. Something is being added to and never removed from.

This service was built against Go 1.7, so the obvious move was net/http/pprof. I'd already got it mounted on an internal port, which past-me deserves a small amount of credit for. Grabbing a heap profile is then just:

go tool pprof http://localhost:6060/debug/pprof/heap

The trick with a slow leak is to take two profiles some hours apart and diff them, rather than staring at a single snapshot. A single heap profile shows you what's big right now, which might just be a legitimately large buffer. The diff shows you what's growing, which is the thing you actually care about.

go tool pprof -base heap.0900 heap.1500

A flame of profiling boxes on screen

top inside pprof put one allocation at the very top of the diff, and it accounted for nearly all of the growth between the two samples. Not a library. Not the runtime. A function in my own package with my own name on the commits.

The map that ate the heap

Here is the offending shape, cleaned up a little so it fits on a slide:

type cache struct {
    mu      sync.Mutex
    entries map[string]*entry
}

func (c *cache) get(key string) *entry {
    c.mu.Lock()
    defer c.mu.Unlock()
    e, ok := c.entries[key]
    if !ok {
        e = newEntry(key)
        c.entries[key] = e
    }
    return e
}

Read that and it looks completely reasonable. It is a cache. The problem is that it is only ever a cache in the optimistic sense, the sense where you imagine the set of keys is small and stable. There was nothing here to remove entries. Nothing to bound the size. Nothing to expire.

And the keys were not small and stable. The key was derived from a per-request identifier that was, for all practical purposes, unique. Every request minted a fresh key, every fresh key minted a fresh entry, and that entry sat in the map forever holding a pointer to a chunk of associated state. The map didn't leak in any exotic Go sense. It leaked in the most basic sense imaginable: I put things in and never took them out. The garbage collector can't help you when you are still holding the reference. As far as the runtime is concerned, that memory is loved and wanted.

The reason it took four days rather than four minutes is just arithmetic. Each entry was small, the request rate was moderate, and the box had a lot of memory. Multiply a small leak by a few hundred thousand requests an hour and wait, and eventually you reach the limit. Slow leaks are the worst kind precisely because they survive every quick test you throw at them. The thing ran fine in staging for a week because staging never saw four days of production traffic.

The fix, and the better fix

The quick fix was to stop using the map as unbounded storage. The keys genuinely didn't need to live beyond the request, so the entry could just be a local value passed down the call stack, and the map deleted entirely. That removed the leak and a lock-contention point in one go, which was a pleasant bonus.

Where a cache really is wanted, the rule I now hold to is simple: a cache without an eviction policy is not a cache, it is a memory leak with good intentions. Bound it by size, or bound it by time, or both. Even a crude LRU with a fixed capacity would have turned this from an outage into a non-event, because the worst case is then "we cache slightly fewer things than ideal", not "the process dies on a schedule".

// at minimum, delete when you're done with it
defer delete(c.entries, key)

What stuck with me afterwards wasn't the bug, it was how long I spent assuming it couldn't be mine. The profiler had the answer in about ten minutes once I actually asked it. The four days of denial were entirely self-inflicted. These days the first thing I do with a slow leak is take two heap profiles and diff them, before I form a single opinion. The opinion is almost always wrong, and the diff almost never is.