The service didn't crash. That was the annoying part. It just slowly, politely, over about four days, ate all the memory on the box until the kernel got bored and killed it. Then it restarted, sat happily at a couple of hundred megabytes, and began the slow climb all over again. A sawtooth on the memory graph that you could set your watch by.
A clean OOM-and-restart every few days is the kind of bug that's very easy to ignore. The thing works. Users don't notice. The restart is graceful. So it lived on the "we should look at that" list for longer than I'm proud of, until the period shortened from four days to two and started waking someone up.
What it looked like
Resident memory climbing in a straight line, no plateau, no garbage collection ever clawing it back. That straight line is the tell. A normal service breathes: it allocates, the GC reclaims, it settles into a band. This one only ever went up. Something was holding references the garbage collector wasn't allowed to free, which means somewhere a live thing was pointing at memory we no longer cared about.
Go gives you the tools for this almost for free, and I should have reached for them on day one instead of day three. pprof is built in. You expose it, you take a heap profile, and you read it.
import _ "net/http/pprof"
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// ... the rest of the service
}
Then, against a process that had been running long enough to be visibly bloated:
go tool pprof http://localhost:6060/debug/pprof/heap
The trick with a leak is to take two heap profiles a while apart and diff them. A single snapshot tells you what's big. The diff tells you what's growing, which is the actual question. pprof will do this with -base, comparing a later profile against an earlier one and showing you only the delta.
# snapshot, wait an hour of real traffic, snapshot again
curl -s localhost:6060/debug/pprof/heap > heap1.pb.gz
sleep 3600
curl -s localhost:6060/debug/pprof/heap > heap2.pb.gz
go tool pprof -base heap1.pb.gz heap2.pb.gz
Inside pprof I went straight to top and then list on the function it named. One caveat worth knowing: by default the heap profile shows inuse_space, which is live memory, exactly what you want for a leak. There's also alloc_space, the total ever allocated, which is the right view for allocation churn but will mislead you here because everything churns. For a leak you want what's still resident and growing, so inuse_space it is.
What it actually was
The diff pointed at one allocation site, and it was embarrassing once I saw it. A cache. Of course it was a cache.
Somewhere early in the service's life, someone (me, the blame is mine, let's not pretend) had added a little in-memory map to avoid recomputing something per request:
var resultCache = map[string]*Result{}
func cachedLookup(key string) *Result {
if r, ok := resultCache[key]; ok {
return r
}
r := expensiveLookup(key)
resultCache[key] = r
return r
}
Read that and the bug is right there in plain sight. We only ever write to the map. There is no eviction, no expiry, no upper bound, nothing that ever deletes a key. Every distinct key we ever saw is held forever, along with whatever it points to.
It worked fine for ages because in testing, and early in production, the set of keys was small and bounded. Then the usage pattern shifted, the keys became effectively unbounded (they had a user-ish identifier baked in), and the map quietly became a log of everything the service had ever been asked about. A leak with extra steps. Not a leak in the C sense of lost pointers, but exactly the same symptom: memory that's reachable, so the GC won't touch it, but that we'll never actually use again.
This is the thing that makes these so easy to miss in code review. There's no obvious bug. Every line is correct in isolation. The map is declared correctly, the read is correct, the write is correct, the function returns the right thing. The bug isn't in any single statement, it's in the absence of a statement: the delete that nobody ever wrote, the bound nobody set. You can't spot a missing line by reading the lines that are there, which is why this class of problem survives review so cheerfully and only shows up as a shape on a graph weeks later.
It's also worth saying why this leaks in a garbage-collected language at all, because people sometimes assume the GC saves you from yourself here. It doesn't, and it can't. The garbage collector frees things that are no longer reachable. Everything in that map is reachable, by definition, because the map is a package-level variable that lives forever and holds a reference to every entry. The GC is doing its job perfectly. We're the ones holding the door open.
The fix, and the better fix
The quick fix was to stop pretending this was a cache I'd thought about and make it one. A bounded LRU with a sensible maximum:
cache, _ := lru.New[string, *Result](10000)
Swap the bare map for that, and the memory graph turned from a ramp into a flat line with a bit of texture, exactly the breathing I wanted to see. The OOM-restart loop simply stopped.
But the better fix was the conversation afterwards, which was: did we even need this cache? expensiveLookup turned out to be about three milliseconds, and we were caching it to save a database round trip that the database was perfectly happy to serve. For a chunk of the call sites we deleted the cache entirely and the latency was unmeasurably different. The fastest map is the one you never populate.
What I took from it
A few things, none of them new, all of them worth saying again.
- Any unbounded data structure that lives for the process lifetime is a leak waiting for the right traffic. A map you only ever insert into is the classic. If there's no
delete, no expiry and no size cap, treat it as a memory leak with a politer name. - A clean OOM-and-restart is still a bug. Graceful failure is a wonderful thing, but it makes real problems easy to ignore, and ignored problems get worse on their own schedule, not yours.
pprofwith a base profile would have told me this in twenty minutes. I spent two days staring at graphs first because the problem didn't feel urgent. The tooling was sitting right there the whole time.
The thing I keep relearning is that "cache" is a word that lets you skip thinking. The moment you call a map a cache, your brain files it as a clever optimisation and stops asking the obvious question, which is what gets removed from it and when. If the answer is "nothing, ever", you haven't built a cache. You've built a slow leak with good branding.